SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Beyond Map/Reduce: 

Getting Creative with Parallel
         Processing

              "
          Ed Kohlwey
          @ekohlwey
   kohlwey_edmund@bah.com
Overview"
•  Within the last year:
  –  Two cluster schedulers have been released
  –  Two BSP frameworks have been released
  –  An in-memory Map/Reduce has been
     released
  –  Accumulo has been released
•  More importantly
  –  We have been given the tools to program in
     something besides Map/Reduce and MPI
What About…"
•  This talk covers a few specific frameworks
•  There’s lots more out there
Motivations for Schedulers"

 The cornerstone of new cluster
   computing environments
Different Tasks Have Different
               Needs"


        Host 7           Host 5
        Host 3           Host 2
CPU RAM Host 1   CPU RAM Host 1   CPU RAM

  Task A           Task B          Task C
Clusters Often Donʼt Accommodate
               This"
   Percentage of Cluster            Expense of Hosts Required
          Load                           to Execute Task




    Task A   Task B   Task C            Task A   Task B   Task C

                        Types of Hosts
                          In Cluster



                               Type 1
This is How It Should Look"
  Percentage of Cluster          Expense of Hosts Required
         Load                         to Execute Task




  Task A   Task B   Task C           Task A   Task B   Task C

                      Types of Hosts
                        In Cluster




                        Type 1   Type 2
Economic Reasons"
Power Consumption




                           Load
Simple Example: a Work Queue"
•  Data scientists execute serial
   implementations of machine learning
   algorithms
•  Some are expensive, some are not
•  Scientists aren’t running analyses all the time
•  Solution 1:
   –  Give all the analysts a big workstation
•  Solution 2:
   –  Give the analysts all thin clients and let them
      share a cluster
Advantages for Moving to a Thin
       Client/Cluster Model"
•  Scalability
  –  All analyst capabilities can be enhances by
     adding one host
•  Increases resource utilization
  –  Workstations are expensive, and will be
     highly under-utilized
•  Increase availability
  –  Using a distributed file system to store data
Desirable Scheduler Features"
                                                                  YARN	
     Mesos	
  
Operate	
  on	
  heterogeneous	
  clusters	
                      Y	
        Y	
  
Highly	
  Available	
                                             Y	
        Y	
  
Pluggable	
  scheduling	
  policies	
                             Y	
        Y	
  
Authen9ca9on	
                                                    Y	
        N	
  
Task	
  ar9fact	
  distribu9on	
                                  Y	
        P	
  
Scheduling	
  policy	
  based	
  on	
  mul9ple	
  resources	
     N	
        Y	
  
(RAM,	
  CPU)	
  
Mul9ple	
  Queues	
                                               Y	
        N	
  
Fast	
  accept/reject	
  model	
                                  N	
        P	
  
Reusable	
  method	
  of	
  describing	
  resource	
              Y	
        N	
  
requirements	
  
Pluggable	
  Isola9on	
                                           N	
        Y	
  
“Compute	
  Units”	
                                              N	
        N	
  
New Compute Environments"

  BSP, In-Memory Map/Reduce,
   and Streaming Processing
(Hadoop) Map/Reduce Pros &
             Cons"
•  Map/Reduce implements partitioned,
   parallel sorting
  –  Many algorithms (relational) express well
  –  Creates O(n lg(n)) runtime constraints for
     some problems that wouldn’t otherwise have
     them
•  Hadoop M/R is good for bulk jobs
In-Memory Map/Reduce"
•  Memory is fast
•  Often, after the map phase, a whole data
   set can fit in the memory of the cluster
•  Spark provides this, as well as a very
   succinct programming environment
   courtesy of Scala and it’s closures
In-Memory Performance"
                                        Logistic Regression Performance Comparison
           4000



           3000
Time (s)




           2000


                                                                                        Hadoop
           1000
                                                                                        Spark


            0
                                5                        10                   20   30
                                                                 Iterations
                  *Numbers taken from http://spark-project.org
Spark Wordcount"
val file = spark.textFile("hdfs://...”) 	
file.flatMap(line => line.split(" "))	
    .map(word => (word, 1))	
    .reduceByKey(_ + _)
Hadoop Wordcount"
public class WordCount {	                                                 public static void main(String[] args) throws Exception {	
	                                                                              Configuration conf = new Configuration();	
  public static class TokenizerMapper 	                                        String[] otherArgs = new GenericOptionsParser(conf,
          extends Mapper<Object, Text, Text, IntWritable>{	               args).getRemainingArgs();	
     	                                                                         if (otherArgs.length != 2) {	
     private final static IntWritable one = new IntWritable(1);	                  System.err.println("Usage: wordcount <in> <out>");	
     private Text word = new Text();	                                             System.exit(2);	
        	                                                                      }	
     public void map(Object key, Text value, Context context	                  Job job = new Job(conf, "word count");	
                       ) throws IOException, InterruptedException {	           job.setJarByClass(WordCount.class);	
        StringTokenizer itr = new StringTokenizer(value.toString());	          job.setMapperClass(TokenizerMapper.class);	
        while (itr.hasMoreTokens()) {	                                         job.setCombinerClass(IntSumReducer.class);	
           word.set(itr.nextToken());	                                         job.setReducerClass(IntSumReducer.class);	
           context.write(word, one);	                                          job.setOutputKeyClass(Text.class);	
        }	                                                                     job.setOutputValueClass(IntWritable.class);	
     }	                                                                        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));	
  }	                                                                           FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));	
  	                                                                            System.exit(job.waitForCompletion(true) ? 0 : 1);	
  public static class IntSumReducer 	                                       }	
          extends Reducer<Text,IntWritable,Text,IntWritable> {	           }
     private IntWritable result = new IntWritable();	
	
     public void reduce(Text key, Iterable<IntWritable> values, 	
                          Context context	
                          ) throws IOException, InterruptedException {	
        int sum = 0;	
        for (IntWritable val : values) {	
           sum += val.get();	
        }	
        result.set(sum);	
        context.write(key, result);	
     }	
  }
Streaming Processing: Accumulo"
•  Accumulo is a BigTable implementation
•  Idea: accumulate values in a column
    –  “map” using the ETL process
•  Summarize values (stored in sorted order) at read-time
    –  “reduce” process
•  No control over partitioning outside a row
    –  Accumulo doesn’t suffer from the column family problem that HBase
       has, so this is ok
•  Less consistent than Map/Reduce because race conditions can
   occur with respect to the scan cursor
•  Iterator programming environment allows you to compose “reduce”
   operations
•  Implementing streaming Map/Reduce over a BigTable
   implementation is a hybrid of in-memory and disk based
   approaches
•  Allows revision of figures due to data provenance issues
BSP"

Generalizing Map/Reduce for
     graph processing
BSP"
•  First proposed by Valiant in 1990
•  Good at expressing iterative computation
•  Good at expressing graph algorithms
•  Concerned with passing messages
   between virtual processors
•  Perhaps the most famous implementation
   is Pregel
MR Graph Traversal"
Map	
                                                Sort	
  +	
     Reduce	
  
                                                     Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
  
B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
  



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
  
MR Graph Traversal"
Map	
                                                Sort	
  +	
     Reduce	
  
            I want to send a                         Shuffle	
  
        message	
  to	
  C!


A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
  
B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
  



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
  
MR Graph Traversal"
Map	
                                                               Sort	
  +	
     Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
  



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
     Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
  



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                                                              Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                     An	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                B
                                                                                    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                CC
                                                                                    n,                                 m	
  
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                                                                Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                     An	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
     A 	
  
                                                                                                                                                  n



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                B
                                                                                    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
      B 	
  
                                                                                                                                                  n
                                                                                                                                                        I got it!




C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                CC
                                                                                    n,                          mè	
                          C 	
  
                                                                                                                                                  n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                                                                      Reduce	
  
                                                                    Shuffle	
         O((n+m)	
  lg(n+m)	
  )	
  



A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                     A      n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
     A 	
  
                                                                                                                                                        n



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                B     n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
      B 	
  
                                                                                                                                                        n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                CC    n,                          mè	
                          C 	
  
                                                                                                                                                        n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                           Reduce	
  
                                                                    Shuffle	
  This	
  can	
  be	
  op9mized	
  to	
  O(m)	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                        A     n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
     A 	
  
                                                                                                                                                          n



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                   B    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
      B 	
  
                                                                                                                                                          n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                   CC   n,                          mè	
                          C 	
  
                                                                                                                                                          n
The BSP Version"
Compute	
                                                                                           Exchange	
                                                                                Synchronize	
  
                                                                                                    Messages	
  
                                                                                                    	
  



A 	
  
     n                                                                              	
     C 	
  
                                                                                              m     	
  	
  	
                         	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
     A 	
  
                                                                                                                                                                                                   n



B   n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
                                                                                                                                                   B 	
  
                                                                                                                                                                                                   n

                                                                                                    	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  

C                                                    B
    n	
  	
  	
  	
  	
  	
  	
  	
  è	
  	
  	
  	
  	
  	
  	
  	
  	
   m	
                                                                                                               C 	
  
                                                                                                                                                                                                  n
The BSP Version"
                                                 No9ce	
  A	
  and	
  C’s	
  message	
  
Compute	
                                                                                  Exchange	
                                                                                          Synchronize	
  
                                                 exchange	
  isn’t	
  closely	
  
                                                                                           Messages	
  
                                                 coupled,	
  providing	
  beEer	
  I/O	
  
                                                                                           	
  
                                                 u9liza9on	
  


A 	
  
     n                                                                              	
     C 	
                      m   	
  	
  	
     	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
     A 	
  
                                                                                                                                                                                                    n

                                                                                                              	
  
                                                                                           	
  	
  	
  	
  	
  

B   n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
                                                                                                                                                    B 	
  
                                                                                                                                                                                                    n



C                                                    B
    n	
  	
  	
  	
  	
  	
  	
  	
  è	
  	
  	
  	
  	
  	
  	
  	
  	
   m	
                                                                                                                C 	
  
                                                                                                                                                                                                   n
The BSP Version"
            Also,	
  no9ce	
  we	
  don’t	
  necessarily	
  
Compute	
   have	
  to	
  copy	
  the	
  en9re	
  graph	
                                                                 Exchange	
                                                            Synchronize	
  
            state.	
  We	
  just	
  send	
  whatever	
                                                                    Messages	
  
            messages	
  need	
  to	
  be	
  sent	
                                                                        	
  



A 	
  n                                                                              	
     C 	
                      m   	
  	
  	
     	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
     A 	
  
                                                                                                                                                                                                     n

                                                                                                               	
  
                                                                                            	
  	
  	
  	
  	
  

B    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
                                                                                                                                                    B 	
  
                                                                                                                                                                                                     n



C                                                     B
     n	
  	
  	
  	
  	
  	
  	
  	
  è	
  	
  	
  	
  	
  	
  	
  	
  	
   m	
                                                                                                                C 	
  
                                                                                                                                                                                                    n
BSP Implementations"
•  Giraph
  –  Currently an Apache Incubator project
  –  Has a growing community
  –  Runs during the Hadoop Map phase
•  GoldenOrb
  –  Not actively maintained since the summer
•  Both implementations are in-memory,
   modeled after Pregel
Contact Info"
Ed Kohlwey
Booz | Allen | Hamilton
@ekohlwey
kohlwey_edmund@bah.com

Mais conteúdo relacionado

Mais procurados

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupBrian O'Neill
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 

Mais procurados (20)

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Scalding
ScaldingScalding
Scalding
 

Semelhante a Beyond Map/Reduce: Getting Creative With Parallel Processing

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxpetabridge
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAnanth PackkilDurai
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 

Semelhante a Beyond Map/Reduce: Getting Creative With Parallel Processing (20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 

Último

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Último (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Beyond Map/Reduce: Getting Creative With Parallel Processing

  • 1. Beyond Map/Reduce: 
 Getting Creative with Parallel Processing
 " Ed Kohlwey @ekohlwey kohlwey_edmund@bah.com
  • 2. Overview" •  Within the last year: –  Two cluster schedulers have been released –  Two BSP frameworks have been released –  An in-memory Map/Reduce has been released –  Accumulo has been released •  More importantly –  We have been given the tools to program in something besides Map/Reduce and MPI
  • 3. What About…" •  This talk covers a few specific frameworks •  There’s lots more out there
  • 4. Motivations for Schedulers" The cornerstone of new cluster computing environments
  • 5. Different Tasks Have Different Needs" Host 7 Host 5 Host 3 Host 2 CPU RAM Host 1 CPU RAM Host 1 CPU RAM Task A Task B Task C
  • 6. Clusters Often Donʼt Accommodate This" Percentage of Cluster Expense of Hosts Required Load to Execute Task Task A Task B Task C Task A Task B Task C Types of Hosts In Cluster Type 1
  • 7. This is How It Should Look" Percentage of Cluster Expense of Hosts Required Load to Execute Task Task A Task B Task C Task A Task B Task C Types of Hosts In Cluster Type 1 Type 2
  • 9. Simple Example: a Work Queue" •  Data scientists execute serial implementations of machine learning algorithms •  Some are expensive, some are not •  Scientists aren’t running analyses all the time •  Solution 1: –  Give all the analysts a big workstation •  Solution 2: –  Give the analysts all thin clients and let them share a cluster
  • 10. Advantages for Moving to a Thin Client/Cluster Model" •  Scalability –  All analyst capabilities can be enhances by adding one host •  Increases resource utilization –  Workstations are expensive, and will be highly under-utilized •  Increase availability –  Using a distributed file system to store data
  • 11. Desirable Scheduler Features" YARN   Mesos   Operate  on  heterogeneous  clusters   Y   Y   Highly  Available   Y   Y   Pluggable  scheduling  policies   Y   Y   Authen9ca9on   Y   N   Task  ar9fact  distribu9on   Y   P   Scheduling  policy  based  on  mul9ple  resources   N   Y   (RAM,  CPU)   Mul9ple  Queues   Y   N   Fast  accept/reject  model   N   P   Reusable  method  of  describing  resource   Y   N   requirements   Pluggable  Isola9on   N   Y   “Compute  Units”   N   N  
  • 12. New Compute Environments" BSP, In-Memory Map/Reduce, and Streaming Processing
  • 13. (Hadoop) Map/Reduce Pros & Cons" •  Map/Reduce implements partitioned, parallel sorting –  Many algorithms (relational) express well –  Creates O(n lg(n)) runtime constraints for some problems that wouldn’t otherwise have them •  Hadoop M/R is good for bulk jobs
  • 14. In-Memory Map/Reduce" •  Memory is fast •  Often, after the map phase, a whole data set can fit in the memory of the cluster •  Spark provides this, as well as a very succinct programming environment courtesy of Scala and it’s closures
  • 15. In-Memory Performance" Logistic Regression Performance Comparison 4000 3000 Time (s) 2000 Hadoop 1000 Spark 0 5 10 20 30 Iterations *Numbers taken from http://spark-project.org
  • 16. Spark Wordcount" val file = spark.textFile("hdfs://...”)  file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _)
  • 17. Hadoop Wordcount" public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public static class TokenizerMapper String[] otherArgs = new GenericOptionsParser(conf, extends Mapper<Object, Text, Text, IntWritable>{ args).getRemainingArgs(); if (otherArgs.length != 2) { private final static IntWritable one = new IntWritable(1); System.err.println("Usage: wordcount <in> <out>"); private Text word = new Text(); System.exit(2); } public void map(Object key, Text value, Context context Job job = new Job(conf, "word count"); ) throws IOException, InterruptedException { job.setJarByClass(WordCount.class); StringTokenizer itr = new StringTokenizer(value.toString()); job.setMapperClass(TokenizerMapper.class); while (itr.hasMoreTokens()) { job.setCombinerClass(IntSumReducer.class); word.set(itr.nextToken()); job.setReducerClass(IntSumReducer.class); context.write(word, one); job.setOutputKeyClass(Text.class); } job.setOutputValueClass(IntWritable.class); } FileInputFormat.addInputPath(job, new Path(otherArgs[0])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); public static class IntSumReducer } extends Reducer<Text,IntWritable,Text,IntWritable> { } private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 18. Streaming Processing: Accumulo" •  Accumulo is a BigTable implementation •  Idea: accumulate values in a column –  “map” using the ETL process •  Summarize values (stored in sorted order) at read-time –  “reduce” process •  No control over partitioning outside a row –  Accumulo doesn’t suffer from the column family problem that HBase has, so this is ok •  Less consistent than Map/Reduce because race conditions can occur with respect to the scan cursor •  Iterator programming environment allows you to compose “reduce” operations •  Implementing streaming Map/Reduce over a BigTable implementation is a hybrid of in-memory and disk based approaches •  Allows revision of figures due to data provenance issues
  • 20. BSP" •  First proposed by Valiant in 1990 •  Good at expressing iterative computation •  Good at expressing graph algorithms •  Concerned with passing messages between virtual processors •  Perhaps the most famous implementation is Pregel
  • 21. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   B n                  è   C n                è  
  • 22. MR Graph Traversal" Map   Sort  +   Reduce   I want to send a Shuffle   message  to  C! A n                  è   B n                  è   C n                è  
  • 23. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   B n                  è   B   n C n                è   C   n
  • 24. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   B n                  è   B   n C n                è   C   n
  • 25. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   An                           B n                  è   B   n B n                           C n                è   C   n CC n, m  
  • 26. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   An                        è   A   n B n                  è   B   n B n                        è   B   n I got it! C n                è   C   n CC n, mè   C   n
  • 27. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   O((n+m)  lg(n+m)  )   A n                  è   A C n, m   A n                        è   A   n B n                  è   B   n B n                        è   B   n C n                è   C   n CC n, mè   C   n
  • 28. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle  This  can  be  op9mized  to  O(m)   A n                  è   A C n, m   A n                        è   A   n B n                  è   B   n B n                        è   B   n C n                è   C   n CC n, mè   C   n
  • 29. The BSP Version" Compute   Exchange   Synchronize   Messages     A   n   C   m                                 A   n B n                     B   n                       C B n                è                   m   C   n
  • 30. The BSP Version" No9ce  A  and  C’s  message   Compute   Exchange   Synchronize   exchange  isn’t  closely   Messages   coupled,  providing  beEer  I/O     u9liza9on   A   n   C   m                                 A   n             B n                     B   n C B n                è                   m   C   n
  • 31. The BSP Version" Also,  no9ce  we  don’t  necessarily   Compute   have  to  copy  the  en9re  graph   Exchange   Synchronize   state.  We  just  send  whatever   Messages   messages  need  to  be  sent     A  n   C   m                                 A   n             B n                     B   n C B n                è                   m   C   n
  • 32. BSP Implementations" •  Giraph –  Currently an Apache Incubator project –  Has a growing community –  Runs during the Hadoop Map phase •  GoldenOrb –  Not actively maintained since the summer •  Both implementations are in-memory, modeled after Pregel
  • 33. Contact Info" Ed Kohlwey Booz | Allen | Hamilton @ekohlwey kohlwey_edmund@bah.com