SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
An	
  Introduc+on	
  to	
  	
  
  Data	
  Intensive	
  Compu+ng	
  
                    	
  
Chapter	
  3:	
  Processing	
  Big	
  Data	
  
            Robert	
  Grossman	
  
           University	
  of	
  Chicago	
  
            Open	
  Data	
  Group	
  
                        	
  
              Collin	
  BenneC	
  
            Open	
  Data	
  Group	
  
                        	
  
           November	
  14,	
  2011	
  
                                                 1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
     a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
     b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
     a.  Databases	
  
     b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
     c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
     a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
     b.  MapReduce	
  
     c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
      (1100-­‐1200)	
  
	
  
Sec+on	
  3.1	
  
Processing	
  Big	
  Data	
  Using	
  
U+lity	
  and	
  Data	
  Clouds	
  




                                 A	
  Google	
  produc+on	
  rack	
  of	
  
                                 servers	
  from	
  about	
  1999.	
  
•  How	
  do	
  you	
  do	
  analy+cs	
  over	
  commodity	
  
   disks	
  and	
  processors?	
  
•  How	
  do	
  you	
  improve	
  the	
  efficiency	
  of	
  
   programmers?	
  
Serial	
  &	
  SMP	
  Algorithms	
  

               Task	
                             Task	
  
                                                     Task	
  
                                                           Task	
  




           local	
  disk*	
                        local	
  disk*	
  


     Serial	
  algorithm	
                   Symmetric	
  
                                             Mul+processing	
  
                                             (SMP)	
  algorithm	
  
•  *	
  local	
  disk	
  and	
  memory	
  
Pleasantly	
  (=	
  Embarrassingly)	
  Parallel	
  	
  

    Task	
  
       Task	
                      Task	
  
                                      Task	
                     Task	
  
                                                                    Task	
  
             Task	
                         Task	
                        Task	
  




    local	
  disk	
                 local	
  disk	
               local	
  disk	
  


                                        MPI	
  

•  Need	
  to	
  par++on	
  data,	
  start	
  tasks,	
  collect	
  results.	
  	
  
•  Oden	
  tasks	
  organized	
  into	
  DAG.	
  
How	
  Do	
  You	
  Program	
  A	
  Data	
  Center?	
  




                                                    7	
  
The	
  Google	
  Data	
  Stack	
  




•  The	
  Google	
  File	
  System	
  (2003)	
  
•  MapReduce:	
  Simplified	
  Data	
  Processing…	
  (2004)	
  
•  BigTable:	
  A	
  Distributed	
  Storage	
  System…	
  (2006)	
  
                                                                 8	
  
Google’s	
  Large	
  Data	
  Cloud	
  

      Applica+ons	
  

   Compute	
  Services	
                 Google’s	
  MapReduce	
  

Data	
  Services	
                   Google’s	
  BigTable	
  

   Storage	
  Services	
                 Google	
  File	
  System	
  (GFS)	
  
Google’s	
  Early	
  Data	
  Stack	
  
       circa	
  2000	
  
                                                                                 9
Hadoop’s	
  Large	
  Data	
  Cloud	
  	
  
              (Open	
  Source)	
  
      Applica+ons	
  

  Compute	
  Services	
      Hadoop’s	
  MapReduce	
  

Data	
  Services	
           NoSQL,	
  e.g.	
  HBase	
  

   Storage	
  Services	
     Hadoop	
  Distributed	
  File	
  
                             System	
  (HDFS)	
  
  Hadoop’s	
  Stack	
  

                                                                 10
A	
  very	
  nice	
  recent	
  book	
  by	
  	
  
Barroso	
  and	
  Holzle	
  
The	
  Amazon	
  Data	
  Stack	
  
                     Amazon	
  uses	
  a	
  highly	
  
                     decentralized,	
  loosely	
  coupled,	
  
                     service	
  oriented	
  architecture	
  
                     consis+ng	
  of	
  hundreds	
  of	
  
                     services.	
  In	
  this	
  environment	
  
                     there	
  is	
  a	
  par+cular	
  need	
  for	
  
                     storage	
  technologies	
  that	
  are	
  
                     always	
  available.	
  For	
  example,	
  
                     customers	
  should	
  be	
  able	
  to	
  
                     view	
  and	
  add	
  items	
  to	
  their	
  
                     shopping	
  cart	
  even	
  if	
  disks	
  are	
  
                     failing,	
  network	
  routes	
  are	
  
                     flapping,	
  or	
  data	
  centers	
  are	
  
                     being	
  destroyed	
  by	
  tornados.	
  	
  
SOSP’07	
  
Amazon	
  Style	
  Data	
  Cloud	
  

                               Load	
  Balancer	
  


                           Simple	
  Queue	
  Service	
  



SDB	
     EC2	
  Instance	
                           EC2	
  Instance	
  
           EC2	
  Instance	
                           EC2	
  Instance	
  
            EC2	
  Instance	
                           EC2	
  Instance	
  
             EC2	
  Instance	
                           EC2	
  Instance	
  
                  EC2	
  Instance	
                           EC2	
  Instance	
  
                   EC2	
  Instances	
                          EC2	
  Instances	
  



                                     S3	
  Storage	
  Services	
  
                                                                                      13
Open	
  Source	
  Versions	
  
•  Eucalyptus	
  
    –  Ability	
  to	
  launch	
  VMs	
  
    –  S3	
  like	
  storage	
  
•  Open	
  Stack	
  
    –  Ability	
  to	
  launch	
  VMs	
  
    –  S3	
  like	
  storage	
  -­‐	
  Swid	
  	
  
•  Cassandra	
  
    –  Key-­‐value	
  store	
  like	
  S3	
  
    –  Columns	
  like	
  BigTable	
  
•  Many	
  other	
  open	
  source	
  Amazon	
  style	
  services	
  
   available.	
  
Some	
  Programming	
  Models	
  for	
  Data	
  Centers	
  
•  Opera+ons	
  over	
  data	
  center	
  of	
  disks	
  
    –  MapReduce	
  (“string-­‐based”	
  scans	
  of	
  data)	
  
    –  User-­‐Defined	
  Func+ons	
  (UDFs)	
  over	
  data	
  center	
  
    –  Launch	
  VMs	
  that	
  all	
  have	
  access	
  to	
  highly	
  scalable	
  and	
  
       available	
  disk-­‐based	
  data.	
  
    –  SQL	
  and	
  NoSQL	
  over	
  data	
  center	
  
•  Opera+ons	
  over	
  data	
  center	
  of	
  memory	
  
    –  Grep	
  over	
  distributed	
  memory	
  
    –  UDFs	
  over	
  distributed	
  memory	
  
    –  Launch	
  VMs	
  that	
  all	
  have	
  access	
  to	
  highly	
  scalable	
  and	
  
       available	
  membory-­‐based	
  data.	
  
    –  SQL	
  and	
  NoSQL	
  over	
  distributed	
  memory	
  
Sec+on	
  3.2	
  	
  	
  
	
  
Processing	
  Data	
  By	
  Scaling	
  Out	
  	
  
Virtual	
  Machines	
  
Processing	
  Big	
  Data	
  PaCern	
  1:	
  	
  
Launch	
  Independent	
  Virtual	
  Machines	
  
  and	
  Task	
  with	
  a	
  Messaging	
  Service	
  
Task	
  With	
  Messaging	
  Service	
  
Task	
  
                    &	
  Use	
  S3	
  (Variant	
  1)	
  
VM	
        Control	
  VM:	
  Launches	
  and	
  
            tasks	
  workers	
  

Messaging	
  Services	
  (AWS	
  SMS,	
  AMQP	
  Service,	
  etc.)	
  

                                  Worker	
  VMs	
  
 Task	
             Task	
                                    Task	
  
                                    …	
  
 VM	
               VM	
                                      VM	
  



                                     S3	
  
Task	
  With	
  Messaging	
  Service	
  
Task	
  
                &	
  Use	
  NoSQL	
  DB	
  (Variant	
  2)	
  
VM	
        Control	
  VM:	
  Launches	
  and	
  
            tasks	
  workers	
  

Messaging	
  Services	
  (AWS	
  SMS,	
  AMQP	
  Service,	
  etc.)	
  

                                  Worker	
  VMs	
  
 Task	
             Task	
                                    Task	
  
                                    …	
  
 VM	
               VM	
                                      VM	
  



                             AWS	
  SimpleDB	
  
Task	
  With	
  Messaging	
  Service	
  
Task	
  
              &	
  Use	
  Clustered	
  FS	
  (Variant	
  3)	
  
VM	
        Control	
  VM:	
  Launches	
  and	
  
            tasks	
  workers	
  

Messaging	
  Services	
  (AWS	
  SMS,	
  AMQP	
  Service,	
  etc.)	
  

                                  Worker	
  VMs	
  
 Task	
             Task	
                                    Task	
  
                                    …	
  
 VM	
               VM	
                                      VM	
  



                               GlusterFS	
  
Sec+on	
  3.3	
  
MapReduce	
  




 Google	
  2004	
  
 Technical	
  Report	
  
Core	
  Concepts	
  
•  Data	
  are	
  (key,	
  value)	
  pairs	
  and	
  that’s	
  it	
  
•  Par++on	
  data	
  over	
  commodity	
  nodes	
  filling	
  racks	
  
   in	
  a	
  data	
  center.	
  
•  Sodware	
  handles	
  failures,	
  restarts,	
  etc.	
  This	
  is	
  
   the	
  hard	
  part.	
  	
  
•  Basic	
  examples:	
  
   –  Word	
  Count	
  
   –  Inverted	
  index	
  
Processing	
  Big	
  Data	
  PaCern	
  2:	
  	
  
         MapReduce	
  
Map	
                              Task	
             Reduce	
  
            Map	
  
             Map	
  
           Task	
                            Tracker	
            Task	
  
            Task	
  
             Task	
  
                                                                                  HDFS	
  
HDFS	
                  local	
  disk	
  
                                                              local	
  disk	
  
           Map	
                              Task	
  
            Map	
  
             Map	
  
           Task	
                            Tracker	
  
            Task	
  
             Task	
  
                                                                 Reduce	
  
                                                                  Task	
  
HDFS	
                  local	
  disk	
  
                                                                                  HDFS	
  
           Map	
                              Task	
  
            Map	
  
             Map	
  
           Task	
                            Tracker	
        local	
  disk	
  
            Task	
  
             Task	
  
HDFS	
                   local	
  disk	
                   Shuffle	
  &	
  Sort	
  
Example:	
  Word	
  Count	
  &	
  Inverted	
  Index	
  

                                                             •  How	
  do	
  you	
  count	
  
                                                                the	
  words	
  in	
  a	
  
                                                                million	
  books?	
  
                                                                 –  (best,	
  7)	
  
                                                             •  Inverted	
  index:	
  
                                                                 –  (best;	
  page	
  1,	
  page	
  
                                                                    82,	
  …)	
  
                                                                 –  (worst;	
  page	
  1,	
  
                                                                    page	
  12,	
  …)	
  	
  
Cover	
  of	
  serial	
  Vol.	
  V,	
  1859,	
  London	
  
•  Assume	
  you	
  have	
  a	
  cluster	
  of	
  50	
  computers,	
  each	
  
   with	
  an	
  aCached	
  local	
  disk	
  and	
  half	
  full	
  of	
  web	
  
   pages.	
  
•  What	
  is	
  a	
  simple	
  parallel	
  programming	
  framework	
  
   that	
  would	
  support	
  the	
  computa+on	
  of	
  word	
  counts	
  
   and	
  inverted	
  indices?	
  
Basic	
  PaCern:	
  Strings	
  




1.	
  Extract	
  words	
       2.	
  Hash	
  and	
     3.	
  Count	
  (or	
  
from	
  web	
  pages	
  in	
   sort	
  words.	
        construct	
  inverted	
  
parallel.	
                                            index)	
  in	
  parallel.	
  
What	
  about	
  data	
  records?	
  




  1.	
  Extract	
  words	
       2.	
  Hash	
  and	
       3.	
  Count	
  (or	
  
  from	
  web	
  pages	
  in	
   sort	
  words.	
          construct	
  inverted	
  
  parallel.	
                                              index)	
  in	
  parallel.	
  

  1.	
  Extract	
  binned	
        2.	
  Hash	
  and	
     3.	
  Count	
  (or	
  
  field	
  value	
  from	
          sort	
  binned	
        construct	
  inverted	
  
  data	
  records	
  in	
          field	
  values.	
       index)	
  in	
  parallel.	
  
  parallel.	
  
Map-­‐Reduce	
  Example	
  
•  Input	
  is	
  files	
  with	
  one	
  document	
  per	
  record	
  
•  User	
  specifies	
  map	
  func+on	
  
    –  key	
  =	
  document	
  URL	
  
    –  Value	
  =	
  document	
  contents	
  
Input	
  of	
  map	
  
 doc	
  cdickens	
  two	
  ci+es ,	
   it	
  was	
  the	
  best	
  of	
  +mes 	
  
Output	
  of	
  map	
  
           it ,	
  1	
  
           was ,	
  1	
  
           the ,	
  1	
  
           best ,	
  1	
  
Example	
  (cont d)	
  
   •  MapReduce	
  library	
  gathers	
  together	
  all	
  pairs	
  
      with	
  the	
  same	
  key	
  value	
  (shuffle/sort	
  phase)	
  
   •  The	
  user-­‐defined	
  reduce	
  func+on	
  combines	
  all	
  
      the	
  values	
  associated	
  with	
  the	
  same	
  key	
  
Input	
  of	
  reduce	
  
key	
  =	
   it 	
           key	
  =	
   was 	
          key	
  =	
   best 	
     key	
  =	
   worst 	
  
values	
  =	
  1,	
  1	
     values	
  =	
  1,	
  1	
     values	
  =	
  1	
       values	
  =	
  1	
  

	
  Output	
  of	
  reduce	
  

    it ,	
  2	
  
    was ,	
  2	
  
    best ,	
  1	
  
    worst ,	
  1	
  
Why	
  Is	
  Word	
  Count	
  Important?	
  
•  It	
  is	
  one	
  of	
  the	
  most	
  important	
  examples	
  for	
  
   the	
  type	
  of	
  text	
  processing	
  oden	
  done	
  with	
  
   MapReduce.	
  
•  There	
  is	
  an	
  important	
  mapping	
  

	
  	
  	
  	
  	
  	
  	
  document	
  	
  	
  	
  	
  <	
  -­‐-­‐-­‐-­‐-­‐	
  >	
  	
  	
  	
  	
  	
  data	
  record	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  words	
  	
  	
  	
  	
  	
  	
  	
  	
  <	
  -­‐-­‐-­‐-­‐-­‐	
  >	
  	
  	
  	
  	
  	
  	
  (field,	
  value)	
  
                                                                                                                                                    Inversion	
  
Pleasantly	
  Parallel	
   MapReduce	
  
Data	
  structure	
     Arbitrary	
             (key,	
  value)	
  pairs	
  
Func+ons	
              Arbitrary	
             Map	
  &	
  Reduce	
  
Middleware	
            MPI	
  (message	
       Hadoop	
  
                        passing)	
  
Ease	
  of	
  use	
     Difficult	
               Medium	
  
Scope	
                 Wide	
                  Narrow	
  
Challenge	
  	
         Geung	
  something	
   Moving	
  to	
  
                        working	
              MapReduce	
  	
  
Common	
  MapReduce	
  Design	
  PaCerns	
  
•    Word	
  count	
  
•    Inversion	
  –	
  inverted	
  index	
  
•    Compu+ng	
  simple	
  sta+s+cs	
  
•    Compu+ng	
  windowed	
  sta+s+cs	
  
•    Sparse	
  matrix	
  (document-­‐term,	
  data	
  record-­‐
     FieldBinValue,	
  …)	
  
•    	
  Site-­‐en+ty	
  sta+s+cs	
  
•    PageRank	
  
•    Par++oned	
  and	
  ensemble	
  models	
  
•    EM	
  
Sec+on	
  3.4	
  
User	
  Defined	
  Func+ons	
  over	
  DFS	
  




               sector.sf.net	
  
Processing	
  Big	
  Data	
  PaCern	
  3:	
  	
  
 User	
  Defined	
  Func+ons	
  over	
  
   Distributed	
  File	
  Systems	
  
Sector/Sphere	
  




•  Sector/Sphere	
  is	
  a	
  plaworm	
  for	
  data	
  intensive	
  
   compu+ng.	
  	
  
Idea	
  1:	
  Apply	
  User	
  Defined	
  Func+ons	
  
(UDF)	
  to	
  Files	
  in	
  a	
  Distributed	
  File	
  System	
  
                map/shuffle                reduce




                     UDF                        UDF


This	
  generalizes	
  Hadoop’s	
  implementa+on	
  of	
  MapReduce	
  
over	
  the	
  Hadoop	
  Distributed	
  File	
  system.	
  
Idea	
  2:	
  Add	
  Security	
  From	
  the	
  Start	
  
Security                     •  Security	
  server	
  maintains	
  
           Master     Client    informa+on	
  about	
  users	
  
Server
       SSL                      and	
  slaves.	
  
                  SSL
                                   •  User	
  access	
  control:	
  
                                      password	
  and	
  client	
  IP	
  
                                      address.	
  
        AAA               data     •  File	
  level	
  access	
  control.	
  
                                   •  Messages	
  are	
  encrypted	
  
                                      over	
  SSL.	
  Cer+ficate	
  is	
  
                                      used	
  for	
  authen+ca+on.	
  
                                   •  Sector	
  is	
  a	
  good	
  basis	
  for	
  
                                      HIPAA	
  compliant	
  
           Slaves                     applica+ons.	
  
Idea	
  3:	
  Extend	
  the	
  Stack	
  to	
  Include	
  
   Network	
  Transport	
  Services	
  
  Compute	
  Services	
            Compute	
  Services	
  


Data	
  Services	
               Data	
  Services	
  

   Storage	
  Services	
            Storage	
  Services	
  

                                       Rou+ng	
  &	
  	
  
    Google,	
  Hadoop	
            Transport	
  Services	
  

                                            Sector	
  

                                                               39	
  
Sec+on	
  3.5	
  
	
  
Compu+ng	
  With	
  Streams:	
  	
  
Warming	
  Up	
  With	
  Means	
  and	
  
Variances	
  
Warm	
  Up:	
  Par++oned	
  Means	
  
                                             Step	
  1.	
  Compute	
  local	
  
                                             (Σ	
  xi,	
  	
  Σ	
  xi2,	
  	
  ni)	
  
                                             in	
  parallel	
  for	
  each	
  
                                             par++on.	
  
                                             	
  
                                             Step	
  2.	
  Compute	
  global	
  
                                             mean	
  and	
  variance	
  from	
  
                                             these	
  tuples.	
  	
  
                                             	
  

•  Means	
  and	
  variances	
  cannot	
  be	
  computed	
  
   naively	
  when	
  the	
  data	
  is	
  in	
  distributed	
  
   par++ons.	
  
Trivial	
  Observa+on	
  1	
  
If	
  si	
  =	
  Σ	
  xi	
  is	
  a	
  the	
  i’th	
  local	
  means,	
  then	
  global	
  
mean	
  =	
  Σ	
  si	
  /	
  	
  Σ	
  ni.	
  
	
  
•  If	
  local	
  means	
  for	
  each	
  par++on	
  are	
  passed	
  
      (without	
  corresponding	
  counts),	
  then	
  there	
  is	
  
      not	
  enough	
  informa+on	
  to	
  compute	
  global	
  
      means.	
  
•  Same	
  tricks	
  works	
  for	
  variance,	
  but	
  need	
  to	
  
      pass	
  triples	
  (Σ	
  xi,	
  	
  Σ	
  xi2,	
  	
  ni).	
  
	
  
Trivial	
  Observa+on	
  2	
  
•  To	
  reduce	
  data	
  passed	
  over	
  the	
  network,	
  
   combine	
  appropriate	
  sta+s+cs	
  as	
  early	
  as	
  
   possible.	
  
•  Consider	
  average.	
  	
  	
  Recall	
  with	
  MapReduce	
  there	
  
   are	
  4	
  steps	
  (Map,	
  Shuffle,	
  Sort	
  and	
  Reduce)	
  and	
  
   Reduce	
  pulls	
  data	
  from	
  local	
  disk	
  that	
  performs	
  
   Map.	
  
•  A	
  Combine	
  Step	
  in	
  MapReduce	
  combines	
  local	
  
   data	
  before	
  it	
  is	
  pulled	
  for	
  Reduce	
  Step.	
  
•  There	
  are	
  built	
  in	
  combiners	
  for	
  counts,	
  means,	
  
   etc.	
  
Sec+on	
  3.6	
  
Hadoop	
  Streams	
  
Processing	
  Big	
  Data	
  PaCern	
  4:	
  	
  
Streams	
  over	
  Distributed	
  File	
  Systems	
  
Hadoop	
  Streams	
  
•  In	
  addi+on	
  to	
  the	
  Java	
  API,	
  Hadoop	
  offers	
  
    –  Streaming	
  interface	
  for	
  any	
  language	
  that	
  supports	
  
       reading	
  and	
  wri+ng	
  to	
  Standard	
  In	
  and	
  Out	
  
    –  Pipes	
  for	
  C++	
  
•  Why	
  would	
  I	
  want	
  to	
  use	
  something	
  besides	
  
   Java?	
  	
  Because	
  Hadoop	
  Streams	
  provide	
  direct	
  
   access	
  to	
  
    –  (Without	
  JNI/	
  NIO)	
  to	
  C++	
  libraries	
  like	
  Boost,	
  GNU	
  
       Scien+fic	
  Library	
  (GSL)	
  
    –  R	
  modules	
  
Pros	
  and	
  Cons	
  
•  Java	
  
     +	
  	
  Best	
  documented	
  
     +	
  	
  Largest	
  community	
  
     –  More	
  LOC	
  per	
  MR	
  job	
  
•  Python	
  
     +	
  	
  Efficient	
  memory	
  handling	
  
     +	
  	
  Programmers	
  can	
  be	
  very	
  efficient	
  
     –  Limited	
  logging	
  /	
  debugging	
  
•  R	
  
     +	
  	
  Vast	
  collec+on	
  of	
  sta+s+cal	
  algorithms	
  
     –  Poor	
  error	
  handling	
  and	
  memory	
  handling	
  
     –  Less	
  familiar	
  to	
  developers	
  
Word	
  Count	
  Python	
  Mapper	
  	
  
def read_input(file):
    for line in file:
        yield line.split()

def main(separator='t'):
    data = read_input(sys.stdin)
    for words in data:
        for word in words:
            print '%s%s%d' % (word, separator, 1)
Word	
  Count	
  Python	
  Reducer	
  
def read_mapper_output(file, separator='t'):
    for line in file:
        yield line.rstrip().split(separator, 1)


def main(sep='t'):
    data = read_mapper_output(sys.stdin, sep=sepa)
    for word, group in groupby(data, itemgetter(0)):
        total_count = sum(int(count) for word,
           count in group)
            print "%s%s%d" % (word, sep, total_count)
MalStone	
  Benchmark	
  
                                                   MalStone	
  A	
              MalStone	
  B	
  
   Hadoop	
  MapReduce	
                           455m	
  13s	
                840m	
  50s	
  

   Hadoop	
  Streams	
            87m	
  29s	
                                  142m	
  32s	
  
   (Python)	
  
   C++	
  implemented	
  UDFs	
   33m	
  40s	
                                  43m	
  44s	
  


Sector/Sphere	
  1.20,	
  Hadoop	
  0.18.3	
  with	
  no	
  replica+on	
  on	
  Phase	
  1	
  of	
  
Open	
  Cloud	
  Testbed	
  in	
  a	
  single	
  rack.	
  	
  Data	
  consisted	
  of	
  20	
  nodes	
  with	
  
500	
  million	
  100-­‐byte	
  records	
  /	
  node.	
  
Word	
  Count	
  R	
  Mapper	
  
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)",
"", line)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn =
FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
   cat(paste(words, "t1n", sep=""), sep="")
}
close(con)




	
  
Word	
  Count	
  R	
  Reducer	
  
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

splitLine <- function(line) {
    val <- unlist(strsplit(line, "t"))
    list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) >
0) {
     line <- trimWhiteSpace(line)
     split <- splitLine(line)
     word <- split$word
     count <- split$count



	
  
Word	
  Count	
  R	
  Reducer	
  (cont’d)	
  
if (exists(word, envir = env, inherits = FALSE)) {
        oldcount <- get(word, envir = env)
        assign(word, oldcount + count, envir = env)
    }
    else assign(word, count, envir = env)
}
close(con)

for (w in ls(env, all = TRUE))
    cat(w, "t", get(w, envir = env), "n", sep =
"”)

	
  
Word	
  Count	
  Java	
  Mapper	
  
public static class Map
   extends Mapper<LongWritable, Text,Text, IntWritable>

            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();

           public void map(LongWritable key, Text value, Context
       context
               throws IOException, InterruptedException {

                 String line = value.toString();
                 StringTokenizer tokenizer = new StringTokenizer(line);
                 while (tokenizer.hasMoreTokens()) {
                     word.set(tokenizer.nextToken());
                     context.write(word, one);
                 }
            }
        }
	
  
Word	
  Count	
  Java	
  Reducer	
  
public static class Reduce
    extends Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key, Iterable<IntWritable> values,
            Context context)
            throws IOException, InterruptedException {

            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
 	
  
Code	
  Comparison	
  –	
  Word	
  Count	
  
                        Mapper	
  
Python
                                                                    Java
def read_input(file):
    for line in file:                                                public static class Map
        yield line.split()                                              extends Mapper<LongWritable, Text,Text, IntWritable>

def main(separator='t'):                                                  private final static IntWritable one = new IntWritable(1);
    data = read_input(sys.stdin)                                           private Text word = new Text();
    for words in data:
        for word in words:                                                 public void map(LongWritable key, Text value, Context context
            print '%s%s%d' % (word, separator, 1)                              throws IOException, InterruptedException {

                                                                               String line = value.toString();
                                                                               StringTokenizer tokenizer = new StringTokenizer(line);
                                                                               while (tokenizer.hasMoreTokens()) {
                                                                                   word.set(tokenizer.nextToken());
                                                                                   context.write(word, one);
                                                                               }
                                                                           }
                                                                     }

R

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
   cat(paste(words, "t1n", sep=""), sep="")
}
close(con)
Code	
  Comparison	
  –	
  Word	
  Count	
  
                       Reducer	
  
Python                                                                     if (exists(word, envir = env, inherits = FALSE)) {
                                                                               oldcount <- get(word, envir = env)
                                                                               assign(word, oldcount + count, envir = env)
def read_mapper_output(file, separator='t'):                              }
    for line in file:                                                      else assign(word, count, envir = env)
        yield line.rstrip().split(separator, 1)                     }
                                                                    close(con)

def main(sep='t'):                                                 for (w in ls(env, all = TRUE))
    data = read_mapper_output(sys.stdin, sep=sepa)                      cat(w, "t", get(w, envir = env), "n", sep = "”)
    for word, group in groupby(data, itemgetter(0)):
        total_count = sum(int(count) for word, count in group)
            print "%s%s%d" % (word, sep, total_count)
                                                                    Java

                                                                    public static class Reduce
                                                                        extends Reducer<Text, IntWritable, Text, IntWritable> {

                                                                           public void reduce(Text key, Iterable<IntWritable> values,
                                                                               Context context)
R                                                                              throws IOException, InterruptedException {

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)                 int sum = 0;
                                                                               for (IntWritable val : values) {
splitLine <- function(line) {                                                      sum += val.get();
    val <- unlist(strsplit(line, "t"))                                        }
    list(word = val[1], count = as.integer(val[2]))                            context.write(key, new IntWritable(sum));
}                                                                          }
                                                                     }
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    split <- splitLine(line)
    word <- split$word
    count <- split$count
Ques+ons?	
  

For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  see	
  
                         rgrossman.com	
  

Mais conteúdo relacionado

Mais procurados

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopEvert Lammerts
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 

Mais procurados (19)

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 

Destaque

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 

Destaque (15)

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 

Semelhante a Processing Big Data (Chapter 3, SC 11 Tutorial)

Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWebCloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWebjineshvaria
 
Hw09 Making Hadoop Easy On Amazon Web Services
Hw09   Making Hadoop Easy On Amazon Web ServicesHw09   Making Hadoop Easy On Amazon Web Services
Hw09 Making Hadoop Easy On Amazon Web ServicesCloudera, Inc.
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Amazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Lessons learned scaling big data in cloud
Lessons learned   scaling big data in cloudLessons learned   scaling big data in cloud
Lessons learned scaling big data in cloudVijay Rayapati
 
Amazon Ec2 Application Design
Amazon Ec2 Application DesignAmazon Ec2 Application Design
Amazon Ec2 Application Designguestd0b61e
 
Application design for the cloud using AWS
Application design for the cloud using AWSApplication design for the cloud using AWS
Application design for the cloud using AWSJonathan Holloway
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Cloud computing & lamp applications
Cloud computing & lamp applicationsCloud computing & lamp applications
Cloud computing & lamp applicationsCorley S.r.l.
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)Robert Grossman
 

Semelhante a Processing Big Data (Chapter 3, SC 11 Tutorial) (20)

Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Cloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWebCloud Architectures - Jinesh Varia - GrepTheWeb
Cloud Architectures - Jinesh Varia - GrepTheWeb
 
Hw09 Making Hadoop Easy On Amazon Web Services
Hw09   Making Hadoop Easy On Amazon Web ServicesHw09   Making Hadoop Easy On Amazon Web Services
Hw09 Making Hadoop Easy On Amazon Web Services
 
Big data and cloud
Big data and cloudBig data and cloud
Big data and cloud
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Lessons learned scaling big data in cloud
Lessons learned   scaling big data in cloudLessons learned   scaling big data in cloud
Lessons learned scaling big data in cloud
 
Amazon Ec2 Application Design
Amazon Ec2 Application DesignAmazon Ec2 Application Design
Amazon Ec2 Application Design
 
Application design for the cloud using AWS
Application design for the cloud using AWSApplication design for the cloud using AWS
Application design for the cloud using AWS
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Cloud computing & lamp applications
Cloud computing & lamp applicationsCloud computing & lamp applications
Cloud computing & lamp applications
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 

Mais de Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

Mais de Robert Grossman (11)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Último

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Último (20)

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Processing Big Data (Chapter 3, SC 11 Tutorial)

  • 1. An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  3:  Processing  Big  Data   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  • 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3. Sec+on  3.1   Processing  Big  Data  Using   U+lity  and  Data  Clouds   A  Google  produc+on  rack  of   servers  from  about  1999.  
  • 4. •  How  do  you  do  analy+cs  over  commodity   disks  and  processors?   •  How  do  you  improve  the  efficiency  of   programmers?  
  • 5. Serial  &  SMP  Algorithms   Task   Task   Task   Task   local  disk*   local  disk*   Serial  algorithm   Symmetric   Mul+processing   (SMP)  algorithm   •  *  local  disk  and  memory  
  • 6. Pleasantly  (=  Embarrassingly)  Parallel     Task   Task   Task   Task   Task   Task   Task   Task   Task   local  disk   local  disk   local  disk   MPI   •  Need  to  par++on  data,  start  tasks,  collect  results.     •  Oden  tasks  organized  into  DAG.  
  • 7. How  Do  You  Program  A  Data  Center?   7  
  • 8. The  Google  Data  Stack   •  The  Google  File  System  (2003)   •  MapReduce:  Simplified  Data  Processing…  (2004)   •  BigTable:  A  Distributed  Storage  System…  (2006)   8  
  • 9. Google’s  Large  Data  Cloud   Applica+ons   Compute  Services   Google’s  MapReduce   Data  Services   Google’s  BigTable   Storage  Services   Google  File  System  (GFS)   Google’s  Early  Data  Stack   circa  2000   9
  • 10. Hadoop’s  Large  Data  Cloud     (Open  Source)   Applica+ons   Compute  Services   Hadoop’s  MapReduce   Data  Services   NoSQL,  e.g.  HBase   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   10
  • 11. A  very  nice  recent  book  by     Barroso  and  Holzle  
  • 12. The  Amazon  Data  Stack   Amazon  uses  a  highly   decentralized,  loosely  coupled,   service  oriented  architecture   consis+ng  of  hundreds  of   services.  In  this  environment   there  is  a  par+cular  need  for   storage  technologies  that  are   always  available.  For  example,   customers  should  be  able  to   view  and  add  items  to  their   shopping  cart  even  if  disks  are   failing,  network  routes  are   flapping,  or  data  centers  are   being  destroyed  by  tornados.     SOSP’07  
  • 13. Amazon  Style  Data  Cloud   Load  Balancer   Simple  Queue  Service   SDB   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instances   S3  Storage  Services   13
  • 14. Open  Source  Versions   •  Eucalyptus   –  Ability  to  launch  VMs   –  S3  like  storage   •  Open  Stack   –  Ability  to  launch  VMs   –  S3  like  storage  -­‐  Swid     •  Cassandra   –  Key-­‐value  store  like  S3   –  Columns  like  BigTable   •  Many  other  open  source  Amazon  style  services   available.  
  • 15. Some  Programming  Models  for  Data  Centers   •  Opera+ons  over  data  center  of  disks   –  MapReduce  (“string-­‐based”  scans  of  data)   –  User-­‐Defined  Func+ons  (UDFs)  over  data  center   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  disk-­‐based  data.   –  SQL  and  NoSQL  over  data  center   •  Opera+ons  over  data  center  of  memory   –  Grep  over  distributed  memory   –  UDFs  over  distributed  memory   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  membory-­‐based  data.   –  SQL  and  NoSQL  over  distributed  memory  
  • 16. Sec+on  3.2         Processing  Data  By  Scaling  Out     Virtual  Machines  
  • 17. Processing  Big  Data  PaCern  1:     Launch  Independent  Virtual  Machines   and  Task  with  a  Messaging  Service  
  • 18. Task  With  Messaging  Service   Task   &  Use  S3  (Variant  1)   VM   Control  VM:  Launches  and   tasks  workers   Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   S3  
  • 19. Task  With  Messaging  Service   Task   &  Use  NoSQL  DB  (Variant  2)   VM   Control  VM:  Launches  and   tasks  workers   Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   AWS  SimpleDB  
  • 20. Task  With  Messaging  Service   Task   &  Use  Clustered  FS  (Variant  3)   VM   Control  VM:  Launches  and   tasks  workers   Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Worker  VMs   Task   Task   Task   …   VM   VM   VM   GlusterFS  
  • 21. Sec+on  3.3   MapReduce   Google  2004   Technical  Report  
  • 22. Core  Concepts   •  Data  are  (key,  value)  pairs  and  that’s  it   •  Par++on  data  over  commodity  nodes  filling  racks   in  a  data  center.   •  Sodware  handles  failures,  restarts,  etc.  This  is   the  hard  part.     •  Basic  examples:   –  Word  Count   –  Inverted  index  
  • 23. Processing  Big  Data  PaCern  2:     MapReduce  
  • 24. Map   Task   Reduce   Map   Map   Task   Tracker   Task   Task   Task   HDFS   HDFS   local  disk   local  disk   Map   Task   Map   Map   Task   Tracker   Task   Task   Reduce   Task   HDFS   local  disk   HDFS   Map   Task   Map   Map   Task   Tracker   local  disk   Task   Task   HDFS   local  disk   Shuffle  &  Sort  
  • 25. Example:  Word  Count  &  Inverted  Index   •  How  do  you  count   the  words  in  a   million  books?   –  (best,  7)   •  Inverted  index:   –  (best;  page  1,  page   82,  …)   –  (worst;  page  1,   page  12,  …)     Cover  of  serial  Vol.  V,  1859,  London  
  • 26. •  Assume  you  have  a  cluster  of  50  computers,  each   with  an  aCached  local  disk  and  half  full  of  web   pages.   •  What  is  a  simple  parallel  programming  framework   that  would  support  the  computa+on  of  word  counts   and  inverted  indices?  
  • 27. Basic  PaCern:  Strings   1.  Extract  words   2.  Hash  and   3.  Count  (or   from  web  pages  in   sort  words.   construct  inverted   parallel.   index)  in  parallel.  
  • 28. What  about  data  records?   1.  Extract  words   2.  Hash  and   3.  Count  (or   from  web  pages  in   sort  words.   construct  inverted   parallel.   index)  in  parallel.   1.  Extract  binned   2.  Hash  and   3.  Count  (or   field  value  from   sort  binned   construct  inverted   data  records  in   field  values.   index)  in  parallel.   parallel.  
  • 29. Map-­‐Reduce  Example   •  Input  is  files  with  one  document  per  record   •  User  specifies  map  func+on   –  key  =  document  URL   –  Value  =  document  contents   Input  of  map   doc  cdickens  two  ci+es ,   it  was  the  best  of  +mes   Output  of  map   it ,  1   was ,  1   the ,  1   best ,  1  
  • 30. Example  (cont d)   •  MapReduce  library  gathers  together  all  pairs   with  the  same  key  value  (shuffle/sort  phase)   •  The  user-­‐defined  reduce  func+on  combines  all   the  values  associated  with  the  same  key   Input  of  reduce   key  =   it   key  =   was   key  =   best   key  =   worst   values  =  1,  1   values  =  1,  1   values  =  1   values  =  1    Output  of  reduce   it ,  2   was ,  2   best ,  1   worst ,  1  
  • 31. Why  Is  Word  Count  Important?   •  It  is  one  of  the  most  important  examples  for   the  type  of  text  processing  oden  done  with   MapReduce.   •  There  is  an  important  mapping                document          <  -­‐-­‐-­‐-­‐-­‐  >            data  record                      words                  <  -­‐-­‐-­‐-­‐-­‐  >              (field,  value)   Inversion  
  • 32. Pleasantly  Parallel   MapReduce   Data  structure   Arbitrary   (key,  value)  pairs   Func+ons   Arbitrary   Map  &  Reduce   Middleware   MPI  (message   Hadoop   passing)   Ease  of  use   Difficult   Medium   Scope   Wide   Narrow   Challenge     Geung  something   Moving  to   working   MapReduce    
  • 33. Common  MapReduce  Design  PaCerns   •  Word  count   •  Inversion  –  inverted  index   •  Compu+ng  simple  sta+s+cs   •  Compu+ng  windowed  sta+s+cs   •  Sparse  matrix  (document-­‐term,  data  record-­‐ FieldBinValue,  …)   •   Site-­‐en+ty  sta+s+cs   •  PageRank   •  Par++oned  and  ensemble  models   •  EM  
  • 34. Sec+on  3.4   User  Defined  Func+ons  over  DFS   sector.sf.net  
  • 35. Processing  Big  Data  PaCern  3:     User  Defined  Func+ons  over   Distributed  File  Systems  
  • 36. Sector/Sphere   •  Sector/Sphere  is  a  plaworm  for  data  intensive   compu+ng.    
  • 37. Idea  1:  Apply  User  Defined  Func+ons   (UDF)  to  Files  in  a  Distributed  File  System   map/shuffle reduce UDF UDF This  generalizes  Hadoop’s  implementa+on  of  MapReduce   over  the  Hadoop  Distributed  File  system.  
  • 38. Idea  2:  Add  Security  From  the  Start   Security •  Security  server  maintains   Master Client informa+on  about  users   Server SSL and  slaves.   SSL •  User  access  control:   password  and  client  IP   address.   AAA data •  File  level  access  control.   •  Messages  are  encrypted   over  SSL.  Cer+ficate  is   used  for  authen+ca+on.   •  Sector  is  a  good  basis  for   HIPAA  compliant   Slaves applica+ons.  
  • 39. Idea  3:  Extend  the  Stack  to  Include   Network  Transport  Services   Compute  Services   Compute  Services   Data  Services   Data  Services   Storage  Services   Storage  Services   Rou+ng  &     Google,  Hadoop   Transport  Services   Sector   39  
  • 40. Sec+on  3.5     Compu+ng  With  Streams:     Warming  Up  With  Means  and   Variances  
  • 41. Warm  Up:  Par++oned  Means   Step  1.  Compute  local   (Σ  xi,    Σ  xi2,    ni)   in  parallel  for  each   par++on.     Step  2.  Compute  global   mean  and  variance  from   these  tuples.       •  Means  and  variances  cannot  be  computed   naively  when  the  data  is  in  distributed   par++ons.  
  • 42. Trivial  Observa+on  1   If  si  =  Σ  xi  is  a  the  i’th  local  means,  then  global   mean  =  Σ  si  /    Σ  ni.     •  If  local  means  for  each  par++on  are  passed   (without  corresponding  counts),  then  there  is   not  enough  informa+on  to  compute  global   means.   •  Same  tricks  works  for  variance,  but  need  to   pass  triples  (Σ  xi,    Σ  xi2,    ni).    
  • 43. Trivial  Observa+on  2   •  To  reduce  data  passed  over  the  network,   combine  appropriate  sta+s+cs  as  early  as   possible.   •  Consider  average.      Recall  with  MapReduce  there   are  4  steps  (Map,  Shuffle,  Sort  and  Reduce)  and   Reduce  pulls  data  from  local  disk  that  performs   Map.   •  A  Combine  Step  in  MapReduce  combines  local   data  before  it  is  pulled  for  Reduce  Step.   •  There  are  built  in  combiners  for  counts,  means,   etc.  
  • 44. Sec+on  3.6   Hadoop  Streams  
  • 45. Processing  Big  Data  PaCern  4:     Streams  over  Distributed  File  Systems  
  • 46. Hadoop  Streams   •  In  addi+on  to  the  Java  API,  Hadoop  offers   –  Streaming  interface  for  any  language  that  supports   reading  and  wri+ng  to  Standard  In  and  Out   –  Pipes  for  C++   •  Why  would  I  want  to  use  something  besides   Java?    Because  Hadoop  Streams  provide  direct   access  to   –  (Without  JNI/  NIO)  to  C++  libraries  like  Boost,  GNU   Scien+fic  Library  (GSL)   –  R  modules  
  • 47. Pros  and  Cons   •  Java   +    Best  documented   +    Largest  community   –  More  LOC  per  MR  job   •  Python   +    Efficient  memory  handling   +    Programmers  can  be  very  efficient   –  Limited  logging  /  debugging   •  R   +    Vast  collec+on  of  sta+s+cal  algorithms   –  Poor  error  handling  and  memory  handling   –  Less  familiar  to  developers  
  • 48. Word  Count  Python  Mapper     def read_input(file): for line in file: yield line.split() def main(separator='t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)
  • 49. Word  Count  Python  Reducer   def read_mapper_output(file, separator='t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)
  • 50. MalStone  Benchmark   MalStone  A   MalStone  B   Hadoop  MapReduce   455m  13s   840m  50s   Hadoop  Streams   87m  29s   142m  32s   (Python)   C++  implemented  UDFs   33m  40s   43m  44s   Sector/Sphere  1.20,  Hadoop  0.18.3  with  no  replica+on  on  Phase  1  of   Open  Cloud  Testbed  in  a  single  rack.    Data  consisted  of  20  nodes  with   500  million  100-­‐byte  records  /  node.  
  • 51. Word  Count  R  Mapper   trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="") } close(con)  
  • 52. Word  Count  R  Reducer   trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count  
  • 53. Word  Count  R  Reducer  (cont’d)   if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "t", get(w, envir = env), "n", sep = "”)  
  • 54. Word  Count  Java  Mapper   public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }  
  • 55. Word  Count  Java  Reducer   public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }  
  • 56. Code  Comparison  –  Word  Count   Mapper   Python Java def read_input(file): for line in file: public static class Map yield line.split() extends Mapper<LongWritable, Text,Text, IntWritable> def main(separator='t'): private final static IntWritable one = new IntWritable(1); data = read_input(sys.stdin) private Text word = new Text(); for words in data: for word in words: public void map(LongWritable key, Text value, Context context print '%s%s%d' % (word, separator, 1) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="") } close(con)
  • 57. Code  Comparison  –  Word  Count   Reducer   Python if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) def read_mapper_output(file, separator='t'): } for line in file: else assign(word, count, envir = env) yield line.rstrip().split(separator, 1) } close(con) def main(sep='t'): for (w in ls(env, all = TRUE)) data = read_mapper_output(sys.stdin, sep=sepa) cat(w, "t", get(w, envir = env), "n", sep = "”) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) R throws IOException, InterruptedException { trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) int sum = 0; for (IntWritable val : values) { splitLine <- function(line) { sum += val.get(); val <- unlist(strsplit(line, "t")) } list(word = val[1], count = as.integer(val[2])) context.write(key, new IntWritable(sum)); } } } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count
  • 58. Ques+ons?   For  the  most  current  version  of  these  notes,  see   rgrossman.com