SlideShare uma empresa Scribd logo
1 de 67
Baixar para ler offline
Parallel Computing for Econometricians with
           Amazon Web Services

               Stephen J. Barr

              University of Rochester


                March 2, 2011




                                        .   .   .   .   .   .
The Old Way




         .    .   .   .   .   .
.




.
The New Way




         .    .   .   .   .   .
.




.
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
Algorithms and Implementations



      “Stupidly parallel” - e.g. a for loop where each iteration is
      independent.
          Only 1 computer? (need 1-8 cores) - use the R multicore
          package on a single EC2 node.
          Need more? Use Hadoop / MapReduce - can do complicated
          mapping and aggregation, in addition to the stupidly parallel
          stuff
      MapReduce - use Hadoop directly (Java), Hadoop Streaming
      (any programming language), rhipe R package (R on
      Hadoop).




                                                .    .    .    .      .   .
In this presentation, we will be using Hadoop either directly
through Elastic MapReduce or indirectly via the Segue package for
R




                                           .   .    .   .    .      .
Alternatives




       Wait a long time
       Use multicores, eg.
       http://www.rforge.net/doc/packages/multicore/mclapply.html
       Take over the computer lab and start jobs by hand
       Buy your own cluster (huge initial cost and will be unutilized
       most of the time)




                                               .    .    .    .   .     .
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
What is it?




       Hadoop is made by the Apache Software Foundation, which
       makes open source software. Contributors to the foundation
       are both large companies and individuals.
       Hadoop Common: The common utilities that support the
       other Hadoop subprojects.
       HDFS: A distributed file system that provides high throughput
       access to application data.
       MapReduce: A software framework for distributed processing
       of large data sets on compute clusters.
       Often, when people say “Hadoop” they mean Hadoop’s
       implementation of the map reduce algorithm.
       Algorithm made by google. Documented here:
       http://labs.google.com/papers/mapreduce.html .
                                               .  .    .    .         .
What is it for?




       Used to process many TB of webserver logs for metrics, target
       ad placement, etc
       Users include:
           Google - calculating pagerank, processing traffic, etc.
           Yahoo - > 100,000 CPUs in various clusters, including a 4,000
           node cluster. Used for ad placement, etc.
           LinkedIn - huge social network graphs - “you may know...”
           Amazon - creating product search indices
       See: http://wiki.apache.org/hadoop/PoweredBy

                                                .    .    .    .    .      .
MAPR EDUCE EXAMPLE – W ORD COUNT

        Input                                         Output
                                                      “This”, 3

    .                                                 “Word”, 2




    Map Phase                                         Reduce
                                       “This”, Doc1    Phase
                 “This”, Doc1
        Mapper                         “This”, Doc2   Reducer
                 “Word”, Doc1   Sort
                                       “This”, Doc3
        Mapper   “This”, Doc2

                 “This”, Doc3
        Mapper                         “Word”, Doc1
                 “Word”, Doc3                         Reducer
                                       “Word”, Doc3
.
Algorithm



   The idea is that the job is broken into map and reduce steps.
       Mapper processes input and creates chunks
       Reducer aggregates the chunks
   Hadoop provides a Java implementation of this algorithm.
   Features include fault-tolerance, adding nodes on the fly, extreme
   speed, and more.
   Hadoop is implemented in Java, and Hadoop Streaming allows
   mapper and reducers over any language, communicating over
   <STDIN>, <STDOUT>.




                                               .   .    .    .     .   .
Hadoop Performance Statistics


      Hadoop is FAST! From 2010 Competition,
      http://sortbenchmark.org/




                                         .     .   .   .   .   .
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
What is this cloud?




      Cloud computing is the idea of abstracting away from
      hardware
      All data and computing resources are managed services
      Pay per hour, based on need



                                            .   .    .   .    .   .
AWS Overview
  Get ready for some acronyms! Amazon Web Services (AWS) is full
  of them. The relevant ones are:
      EC2 - Elastic Compute Cloud - Dynamically get N computers
      for a few cents per hour. Computers range from micro
      instances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl”
      ($2.00/hr) to GPU machines ($2.10/hr).
      EMR - Elastic map Reduce - automates the instantiation of
      Hadoop jobs. Builds the cluster, runs the job, completely in
      the background
      S3 - Simple Storage Service - Store VERY large objects in the
      cloud.
      RDS - Relational Database Service - implementation of
      MySQL database. Easy way to store data and later load into
      R with package RMySQL. E.g.
      select date,price from myTable where TICKER=’AMZN’
                                             .    .    .    .   .     .
AWS Links




     EC2 - http://aws.amazon.com/ec2/
     EMR - http://aws.amazon.com/elasticmapreduce/
        Getting started guide - http://docs.amazonwebservices.
        com/ElasticMapReduce/latest/GettingStartedGuide/
     S3 - http://aws.amazon.com/s3/




                                         .    .    .   .    .    .
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
Steps




    1. Write mapper in R. The output will be aggregatred by
       Hadoop’s aggregate function.
    2. Create input files
    3. Upload all to S3
    4. Configure EMR job in AWS Management Console
    5. Done!




                                            .    .    .   .   .   .
Files


   The directory emr.simpleExample/simpleSimRmapper contains
   the following
        makeData.R generates 1000 csv files with 1,000,000 rows, 4
        columns each. Each file is about 76 MB
        fileSplit.sh takes a directory of input files and prepares
        them for use with EMR (more on this later)
        sjb.simpleMapper.R takes the name of a file from the
        command line, gets it from s3, runs a regression, hands back
        the coefficients. These coefficients are then aggregated using
        aggregate, a standard Hadoop reducer




                                               .   .    .    .   .     .
Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R Example
   The R code - mapper
   Resources List

segue and a SML Example
   Simulated Maximum Likelihood Example
   multicore - on the way to segue
   diving into segue

Other EC2 Software Options

Conclusion

                                          .   .   .   .   .   .
Mapper functions




      INPUT: <STDIN>. This can be
          A seed to a random number generator
          Raw data text to process
          A list of file names to process - we are doing this one.
      OUTPUT: <STDOUT> (print it!), which next goes to the
      reducer.




                                                 .    .    .    .   .   .
General R Mapper Code Outline



1   t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$ ) ” , ” ”
            , line )
2   con <− f i l e ( ” s t d i n ” , open = ” r ” )
3   w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn =
           FALSE ) ) > 0 ) {
4            l i n e <− t r i m W h i t e S p a c e ( l i n e )
5
6          #p r o c e s s and p r i n t r e s u l t s
7   }
8   c l o s e ( con )




                                                                .      .     .      .     .       .
Simple Mapper




  file: sjb.simpleMapper.R Algorithm:
      get the file from s3
      read it
      run regression
      print results in a way that aggregate can read




                                              .   .    .   .   .   .
Lets run it!




           .   .   .   .   .   .
Overview



    1. Made some data with makeData.R
    2. Used fileSplit.sh to make lists of files to grab from s3.
       These lists will be fed into the mapper. Then transferred the
       data and lists to s3. See moveToS3.sh for a list of
       commands, but don’t try to run this directly.
    3. sjb.simpleMapper.R reads lines. Each line is a file. Opens
       the file, does some work, prints some output.
    4. Configure job on EMR using AWS Management Console.
       Using the standard aggregator to aggregate results.




                                              .    .    .    .   .     .
Numbers



  Consider this, in less than 10 min
      Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 cores
      each)
      Installed Linux OS and Hadoop software on all nodes
      Distribute approx. 20GB of data to the nodes
      Run some analysis in R
      Aggregate the results
      Shut down the cluster




                                            .    .   .      .   .    .
Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R Example
   The R code - mapper
   Resources List

segue and a SML Example
   Simulated Maximum Likelihood Example
   multicore - on the way to segue
   diving into segue

Other EC2 Software Options

Conclusion

                                          .   .   .   .   .   .
UsefulLinks




      Good EMR R Discussion
      Hadoop on EMR with C# and F#
      Hadoop Aggregate




                                     .   .   .   .   .   .
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
Description




   From the project website:
       Segue has a simple goal: Parallel functionality in R; two
       lines of code; in under 15 minutes.
                                                         J.D.Long
   From segue homepage: http://code.google.com/p/segue/




                                               .   .    .    .      .   .
AWS API - the segue underlying



      API stands for Application Program Interface
      All Amazon Web Services have API’s, which allow
      programmatic access. This exposes many more features than
      the AWS Managment Console
      For example, through the API one can start and stop a cluster
      without adding jobs, add nodes to a running cluster, etc.
      Using the API, you can write programs and treat clusters as
      the native objects
      segue is such a program




                                             .   .    .    .   .      .
segue usage




      Segue is ideal for CPU bound applications - e.g. simulations
      replaces lapply, which applies a function to elements of a
      list, with emrlapply, which distributes the evaluation of the
      function to a cluster via Elastic Map Reduce
      the list can be anything - seeds to a random number
      generator, matrices to invert, data frames to analyse, etc.




                                              .    .    .    .      .   .
Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R Example
   The R code - mapper
   Resources List

segue and a SML Example
   Simulated Maximum Likelihood Example
   multicore - on the way to segue
   diving into segue

Other EC2 Software Options

Conclusion

                                          .   .   .   .   .   .
code overview




   Note: Code available on my website, http://econsteve.com/r.
   Showing 3 levels of optimization:
       For loops to matrices
       Evaluating firms on multicores
       Evaluating firms on multiple computers on EC2




                                           .    .     .   .   .   .
Simulated MLE



  We use the simulator
                                      [T                    ]
                         ∑
                         N
                                  1∑ ∏
                                    R    i
                  ˆ
              ln LNR =         ln          h(yit |xit , θui
                                                          r
                                  R
                         i=1        r =1   t=1

  where i ∈ N is a person among people, or firm in a set of firms. R
                                                √
  is a number of of simulations to do, where R ∝ N, and Ti is the
  length of the data for firm i.




                                                  .    .        .   .   .   .
With for loops - R pseudocode

   p a n e l L o g L i k . s i m p l e <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) {
       l o g L i k <− 0
       u i r <− qnorm ( s e e d M a t r i x )
       f o r ( n i n 1 :N) {
           LiR <− 0 ;
           f o r ( r i n 1 : R) {
               myProduct <− 1
               a l p h a . r <− mu . a + u i r [ r , ( 2 ∗n ) −1] ∗ s i g m a . a
               b e t a . r <− mu . b + u i r [ r , ( 2 ∗n ) ] ∗ s i g m a . b
               f o r ( t i n 1 : T) {
                   # f i = r e s i d u a l u s i n g Y , THETA
                   myProduct <− myProduct ∗ f i
               }
               LiR <− LiR + myProduct
               L i <− LiR /R
               l o g L i k <− l o g L i k + l o g ( L i )
           } # end f o r r i n R
      } # end f o r n
       return ( logLik )
   }



                                                                       .      .      .       .      .         .
With for loops - R pseudocode




   We then maximize the likelihood function as:
   o p t i m R e s <− optim (THETA . i n i t 1 , p a n e l L o g L i k . s i m p l e ,
   This is extremely slow on one processor, and does not lend itself to
   parallelization. (30 min for 60 firms - didn’t bother to test more).




                                                      .     .    .     .    .     .
Opt 1 - matrices, lists, lapply




   We adopt a new approach with the following rules:
       Structure the data as a list of lists, where each sublist contains
       the data, ticker symbol, and uir for the relevant coefficients
       Make a firm (i ∈ N) likelihood function, and an outer panel
       likelihood function which sums the results of the firms




                                                 .    .    .    .    .      .
Opt 1 - matrices, lists, lapply - firm likelihood
   # t h i s s h o u l d be an e x t r e m e l y f a s t f i r m L i k e l i h o o d f u n c t i o n
   f i r m L i k e l i h o o d <− f u n c t i o n ( d a t a L i s t I t e m , THETA, R) {
        s i g m a . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ;                  s i g m a . a <− THETA
                [3]
       mu . b <− THETA [ 4 ] ;          s i g m a . b <− THETA [ 5 ]

       d a t a . n <− d a t a L i s t I t e m $DATA; X . n <− d a t a . n$X ; Y . n <− d a t a .
              n$Y ;
       T <− nrow ( d a t a . n )

       u i r A l p h a <− d a t a L i s t I t e m $UIRALPHA
       u i r B e t a <− d a t a L i s t I t e m $UIRBETA

       a l p h a . rmat <− mu . a + u i r A l p h a ∗ s i g m a . a
       b e t a . rmat <− mu . b + u i r B e t a ∗ s i g m a . b
       Y t S t a c k <− re pm at (Y . n , R , 1 )
       X t S t a c k <− re pm at (X . n , R , 1 )
       r e s i d M a t <− Y t S t a c k − a l p h a . rmat − X t S t a c k ∗ b e t a . rmat
       f i t M a t <− ( 1 / ( s i g m a . e ∗ s q r t ( 2 ∗ p i ) ) ) ∗ exp ( −( r e s i d M a t ˆ 2 ) / ( 2
                ∗ sigma . e ˆ2) )
       myProductVec <− a p p l y ( f i t M a t , 1 , pr o d )
       L i 2 <− sum ( myProductVec ) /R
       return ( Li2 )
   }
                                                                         .      .       .      .       .       .
The list-based outer loop

   p a n e l L o g L i k . f a s t e r <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x )
           {
      # t h e s e e d m a t r i x a s R rows , and 2 ∗N c o l u m n s w h e r e t h e r e
               a r e N f i r m s and 2 p a r a m e t e r s o f i n t e r e s t ( a l p h a and
               beta )
       u i r <− qnorm ( s e e d M a t r i x )
      R <− nrow ( s e e d M a t r i x )


       # n o t i c e t h a t we can c a l c u l a t e t h e l i k e l i h o o d s
             independently for
       # e a c h f i r m , s o we can make a f u n c t i o n and u s e l a p p l y .
             T h i s w i l l be
       # useful for parallelization


       f i r m L i k <− l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)
       l o g L i k <− sum ( l o g ( u n l i s t ( f i r m L i k ) ) )
       return ( logLik )
   }


                                                                        .      .       .      .      .      .
Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R Example
   The R code - mapper
   Resources List

segue and a SML Example
   Simulated Maximum Likelihood Example
   multicore - on the way to segue
   diving into segue

Other EC2 Software Options

Conclusion

                                          .   .   .   .   .   .
The list-based outer loop - multicore




   Use the R multicore library, and replace lapply with mclapply at
   the outer loop.
   library ( multicore )
   ...
   f i r m L i k <− m c l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)

   This will lead to some substantial speedups.




                                                                      .      .      .      .      .   .
multicore

   N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop
   > proc . time ( )
      user    syst em e l a p s e d
   389.180    36.960 125.674

   N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL
   > proc . time ( )
      user    syst em e l a p s e d
   2705.77 2686.08      417.74

   N: 5000 R: 710 T: 80 logLike: -870744.4
   > proc . time ( )
        user         system      elapsed
   16206.480 16067.150          2768.588

   multicore can provide quick and easy parallelization. Write
   program so that the parallel part is an operation on a list, then
   replace lapply with mclapply.

                                                .    .    .    .       .   .
Bad




      .   .   .   .   .   .
Good




       .   .   .   .   .   .
multicore is nice for optimizing a local job.
Most machines today have at least 2 cores. Many have 4 or 8.
However, that is still only 1 machine. Let’s use n of them →




                                       .    .   .    .   .     .
Tools Overview

Hadoop

Amazon Web Services

A Simple EMR and R Example
   The R code - mapper
   Resources List

segue and a SML Example
   Simulated Maximum Likelihood Example
   multicore - on the way to segue
   diving into segue

Other EC2 Software Options

Conclusion

                                          .   .   .   .   .   .
installing segue




   Install prerequisite packages rjava and catools. On Ubuntu linux:
   sudo apt−g e t i n s t a l l r−c r a n −r j a v a r−c r a n −c a t o o l s
   Then, download and install segue
   http://code.google.com/p/segue/




                                                   .    .    .    .    .    .
Using segue


   Now in R we do:
   > library ( segue )
   As we will be using are AWS account, we are going to need to set
   credentials so that other people can’t launch clusters in our name.
   To get our credentials, go to:
   http://aws.amazon.com/account/ and click “Security
   Credentials”.
   Go back into R.
      setCredentials (" ABC123 " ,
    " REALLY + LONG +12312312+ STRING +456456")




                                                .   .    .    .    .     .
Firing up the cluster in segue



   use the createCluster command.
              c r e a t e C l u s t e r ( n u m I n s t a n c e s =2 , c r a n P a c k a g e s ,
                      filesOnNodes ,
   r O b j e c t s O n N o d e s , e n a b l e D e b u g g i n g=FALSE , i n s t a n c e s P e r N o d e ,
                    m a s t e r I n s t a n c e T y p e=”m1 . s m a l l ” , s l a v e I n s t a n c e T y p e=”m1
                           . small ” ,
   l o c a t i o n=” us−e a s t −1a ” , ec2KeyName , c o p y . image=FALSE ,
                      otherBootstrapActions , sourcePackagesToInstall )

   In our case, lets fire up 10 m2.4xlarge. This gives us 80 cores and
   684 GB of RAM to play with.




                                                                            .       .      .       .       .        .
parallel random number generation


  >m y L i s t <− NULL
  >s e t . s e e d ( 1 )
  >f o r ( i i n 1:10){
       a <− c ( rnorm ( 9 9 9 ) , NA)
       m y L i s t [ [ i ] ] <− a
       }

  >o u t p u t L o c a l <− l a p p l y ( m y L i s t , mean , na . rm=T)
  >outputEmr              <− e m r l a p p l y ( m y C l u s t e r , m y L i s t , mean ,
  na . rm=T)
  > a l l . e q u a l ( outputEmr , o u t p u t L o c a l )
  [ 1 ] TRUE
   segue handles this for you. This is very important for simulation.


                                                        .     .     .    .     .     .
Monte Carlo π estimation


  e s t i m a t e P i <− f u n c t i o n ( s e e d ) {
      set . seed ( seed )
      numDraws <− 1 e6
      r <− . 5 #r a d i u s . . . i n c a s e t h e u n i t c i r c l e i s t o o b o r i n g
      x <− r u n i f ( numDraws , min=−r , max=r )
      y <− r u n i f ( numDraws , min=−r , max=r )
      i n C i r c l e <− i f e l s e ( ( x ˆ2 + y ˆ 2 ) ˆ . 5 < r , 1 , 0 )
      r e t u r n ( sum ( i n C i r c l e ) / l e n g t h ( i n C i r c l e ) ∗ 4 )
  }
  s e e d L i s t <− a s . l i s t ( 1 : 1 0 0 )
  r e q u i r e ( segue )
  m y E s t i m a t e s <− e m r l a p p l y ( m y C l u s t e r , s e e d L i s t , e s t i m a t e P i )
  myPi <− Reduce ( sum , m y E s t i m a t e s ) / l e n g t h ( m y E s t i m a t e s )
  > f o r m a t ( myPi , d i g i t s =10)
  [ 1 ] ” 3.14166556 ”




                                                                        .       .      .       .      .      .
parallel MLE




   Using code from sml.segue.R on my website. It is exactly the
   same as the multicore example, but with the addition of 2 lines to
   start the cluster.




                                               .    .   .    .    .     .
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
EC2 has GPUs


  Cluster GPU Quadruple Extra Large Instance
      22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon
      X5570, quad-core Nehalem architecture)
      2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instance
      storage 64-bit platform
      I/O Performance: Very High (10 Gigabit Ethernet)
      API name: cg1.4xlarge
  The Fermi chip is important because they have ECC memory, so
  simulations are accurate. These are much more robust than gamer
  GPUs - cost $2800 per card. Each machine has 2. You can use for
  $2.10 per hour.



                                           .    .   .    .   .      .
RHIPE




        RHIPE = R and Hadoop Integrated Processing Environment
        http://www.stat.purdue.edu/~sguha/rhipe/
        Implements rhlapply function
        Exposes much more of Hadoop’s underlying functionality,
        including the HDFS ⇒
        May be better for large data applications




                                               .    .   .   .     .   .
StarCluster I

       Allows instantiation of generic clusters on EC2
       Use MPI (Message Passing Interface) for much more
       complicated parallel programs. E.g., holding one giant matrix
       accross the RAM of several nodes
       From their page:
       Simple configuration with sensible defaults
       Single ”start” command to automatically launch and
       configure one or more clusters on EC2
       Support for attaching and NFS-sharing Amazon Elastic Block
       Storage (EBS) volumes for persistent storage across a cluster
       Comes with a publicly available Amazon Machine Image
       (AMI) configured for scientific computing
       AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, and
       other useful libraries
                                               .    .    .   .   .     .
StarCluster II



       Clusters are automatically configured with NFS, Sun Grid
       Engine queuing system, and password-less ssh between
       machines
       Supports user-contributed ”plugins” that allow users to
       perform additional setup routines on the cluster after
       StarCluster’s defaults
       http://web.mit.edu/stardev/cluster/




                                              .    .   .    .    .   .
Matlab


         You can do it in theory, but you need either a license manager
         or use Matlab compiler
         It will cost you.
         Whitepaper from Mathworks: http://www.mathworks.com/
         programs/techkits/ec2_paper.html
         May be able to coax EMR run a compiled Matlab script, but
         you would have to bootstrap each machine to have the
         libraries required to run compiled Matlab applications
         Mathworks has no incentive to support this behaviour
         Requires toolboxes ($$$).




                                                .    .    .     .   .     .
Table of Contents
   Tools Overview

   Hadoop

   Amazon Web Services

   A Simple EMR and R Example
      The R code - mapper
      Resources List

   segue and a SML Example
      Simulated Maximum Likelihood Example
      multicore - on the way to segue
      diving into segue

   Other EC2 Software Options

   Conclusion
                                             .   .   .   .   .   .
EC2 and Hadoop are Extremely Powerful




      Huge and active community behind both Hadoop (Apache)
      and EC2 (Amazon).
      EC2 and AWS in general allow you to change the way you
      think about computing resources, as a service rather than as
      devices to manage.
      New AWS features are always being added




                                             .    .   .    .    .    .
AWS in Education



  AMAZON WILL GIVE YOU MONEY
      Researcher - send them your proposal, they send you credits,
      you thank them in the paper.
      Teacher - if you are teaching a class, each student gets $100
      credit, good for one year. This would be great for teaching
      econometrics, where you can provide a machine image with
      software and data already available.
      Additionally, AWS for your backups (S3) and other tech needs




                                             .    .    .    .    .    .
Resources




      My website http://www.econsteve.com/r for the code in
      this presentation
      AWS Managment Console
      http://aws.amazon.com/console/
      AWS Blog http://aws.typepad.com
      AWS in Education http://aws.amazon.com/education/




                                         .   .   .   .   .    .

Mais conteúdo relacionado

Mais procurados

Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Reading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmerReading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmerChad Cooper
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 

Mais procurados (20)

Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Unit 2
Unit 2Unit 2
Unit 2
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Reading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmerReading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmer
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 

Destaque

Танино хобби
Танино хоббиТанино хобби
Танино хоббиAkuJIa
 
Halftime summary Cormac McGrath
Halftime summary Cormac McGrathHalftime summary Cormac McGrath
Halftime summary Cormac McGrathCormac McGrath
 
диагностический блок
диагностический блокдиагностический блок
диагностический блокAkuJIa
 
114_Ahmed_SEA_in_Planning_Process
114_Ahmed_SEA_in_Planning_Process114_Ahmed_SEA_in_Planning_Process
114_Ahmed_SEA_in_Planning_Processnaziazakir
 
Tarea equipo!
Tarea equipo!Tarea equipo!
Tarea equipo!nevarez1
 
Declive de una_civilizacion
Declive de una_civilizacionDeclive de una_civilizacion
Declive de una_civilizacionEmersson Curup
 
Filters-talent 21
Filters-talent 21Filters-talent 21
Filters-talent 21rodriquezv
 
The great singapore 3 d2n 1
The great singapore 3 d2n 1The great singapore 3 d2n 1
The great singapore 3 d2n 1Suky Naka
 
Uttarkhand relief updates 1st spetember 2013 1
Uttarkhand relief updates 1st spetember 2013 1Uttarkhand relief updates 1st spetember 2013 1
Uttarkhand relief updates 1st spetember 2013 1dexterousdoc
 
Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012
Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012
Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012Elise Beyst
 

Destaque (20)

Lalalalalallalaa
LalalalalallalaaLalalalalallalaa
Lalalalalallalaa
 
Researchmethods
ResearchmethodsResearchmethods
Researchmethods
 
Танино хобби
Танино хоббиТанино хобби
Танино хобби
 
NLR PACE Program
NLR PACE ProgramNLR PACE Program
NLR PACE Program
 
Halftime summary Cormac McGrath
Halftime summary Cormac McGrathHalftime summary Cormac McGrath
Halftime summary Cormac McGrath
 
Swot powerpoint
Swot powerpointSwot powerpoint
Swot powerpoint
 
presentation group 5
presentation group 5presentation group 5
presentation group 5
 
диагностический блок
диагностический блокдиагностический блок
диагностический блок
 
114_Ahmed_SEA_in_Planning_Process
114_Ahmed_SEA_in_Planning_Process114_Ahmed_SEA_in_Planning_Process
114_Ahmed_SEA_in_Planning_Process
 
Portafolio
PortafolioPortafolio
Portafolio
 
Introduktion til slideshare net
Introduktion til slideshare netIntroduktion til slideshare net
Introduktion til slideshare net
 
Tarea equipo!
Tarea equipo!Tarea equipo!
Tarea equipo!
 
Chakala belan
Chakala belanChakala belan
Chakala belan
 
Declive de una_civilizacion
Declive de una_civilizacionDeclive de una_civilizacion
Declive de una_civilizacion
 
Filters-talent 21
Filters-talent 21Filters-talent 21
Filters-talent 21
 
The great singapore 3 d2n 1
The great singapore 3 d2n 1The great singapore 3 d2n 1
The great singapore 3 d2n 1
 
소셜Pr
소셜Pr소셜Pr
소셜Pr
 
Uttarkhand relief updates 1st spetember 2013 1
Uttarkhand relief updates 1st spetember 2013 1Uttarkhand relief updates 1st spetember 2013 1
Uttarkhand relief updates 1st spetember 2013 1
 
Ali Saruhan
Ali SaruhanAli Saruhan
Ali Saruhan
 
Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012
Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012
Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012
 

Semelhante a Parallel Computing for Econometricians with Amazon Web Services

Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMRABC Talks
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 

Semelhante a Parallel Computing for Econometricians with Amazon Web Services (20)

Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Data Science
Data ScienceData Science
Data Science
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Scala+data
Scala+dataScala+data
Scala+data
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 

Último

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 

Último (20)

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 

Parallel Computing for Econometricians with Amazon Web Services

  • 1. Parallel Computing for Econometricians with Amazon Web Services Stephen J. Barr University of Rochester March 2, 2011 . . . . . .
  • 2. The Old Way . . . . . .
  • 3. . .
  • 4. The New Way . . . . . .
  • 5. . .
  • 6. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 7. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 8. Algorithms and Implementations “Stupidly parallel” - e.g. a for loop where each iteration is independent. Only 1 computer? (need 1-8 cores) - use the R multicore package on a single EC2 node. Need more? Use Hadoop / MapReduce - can do complicated mapping and aggregation, in addition to the stupidly parallel stuff MapReduce - use Hadoop directly (Java), Hadoop Streaming (any programming language), rhipe R package (R on Hadoop). . . . . . .
  • 9. In this presentation, we will be using Hadoop either directly through Elastic MapReduce or indirectly via the Segue package for R . . . . . .
  • 10. Alternatives Wait a long time Use multicores, eg. http://www.rforge.net/doc/packages/multicore/mclapply.html Take over the computer lab and start jobs by hand Buy your own cluster (huge initial cost and will be unutilized most of the time) . . . . . .
  • 11. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 12. What is it? Hadoop is made by the Apache Software Foundation, which makes open source software. Contributors to the foundation are both large companies and individuals. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Often, when people say “Hadoop” they mean Hadoop’s implementation of the map reduce algorithm. Algorithm made by google. Documented here: http://labs.google.com/papers/mapreduce.html . . . . . .
  • 13. What is it for? Used to process many TB of webserver logs for metrics, target ad placement, etc Users include: Google - calculating pagerank, processing traffic, etc. Yahoo - > 100,000 CPUs in various clusters, including a 4,000 node cluster. Used for ad placement, etc. LinkedIn - huge social network graphs - “you may know...” Amazon - creating product search indices See: http://wiki.apache.org/hadoop/PoweredBy . . . . . .
  • 14. MAPR EDUCE EXAMPLE – W ORD COUNT Input Output “This”, 3 . “Word”, 2 Map Phase Reduce “This”, Doc1 Phase “This”, Doc1 Mapper “This”, Doc2 Reducer “Word”, Doc1 Sort “This”, Doc3 Mapper “This”, Doc2 “This”, Doc3 Mapper “Word”, Doc1 “Word”, Doc3 Reducer “Word”, Doc3 .
  • 15. Algorithm The idea is that the job is broken into map and reduce steps. Mapper processes input and creates chunks Reducer aggregates the chunks Hadoop provides a Java implementation of this algorithm. Features include fault-tolerance, adding nodes on the fly, extreme speed, and more. Hadoop is implemented in Java, and Hadoop Streaming allows mapper and reducers over any language, communicating over <STDIN>, <STDOUT>. . . . . . .
  • 16. Hadoop Performance Statistics Hadoop is FAST! From 2010 Competition, http://sortbenchmark.org/ . . . . . .
  • 17. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 18. What is this cloud? Cloud computing is the idea of abstracting away from hardware All data and computing resources are managed services Pay per hour, based on need . . . . . .
  • 19. AWS Overview Get ready for some acronyms! Amazon Web Services (AWS) is full of them. The relevant ones are: EC2 - Elastic Compute Cloud - Dynamically get N computers for a few cents per hour. Computers range from micro instances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl” ($2.00/hr) to GPU machines ($2.10/hr). EMR - Elastic map Reduce - automates the instantiation of Hadoop jobs. Builds the cluster, runs the job, completely in the background S3 - Simple Storage Service - Store VERY large objects in the cloud. RDS - Relational Database Service - implementation of MySQL database. Easy way to store data and later load into R with package RMySQL. E.g. select date,price from myTable where TICKER=’AMZN’ . . . . . .
  • 20. AWS Links EC2 - http://aws.amazon.com/ec2/ EMR - http://aws.amazon.com/elasticmapreduce/ Getting started guide - http://docs.amazonwebservices. com/ElasticMapReduce/latest/GettingStartedGuide/ S3 - http://aws.amazon.com/s3/ . . . . . .
  • 21. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 22. Steps 1. Write mapper in R. The output will be aggregatred by Hadoop’s aggregate function. 2. Create input files 3. Upload all to S3 4. Configure EMR job in AWS Management Console 5. Done! . . . . . .
  • 23. Files The directory emr.simpleExample/simpleSimRmapper contains the following makeData.R generates 1000 csv files with 1,000,000 rows, 4 columns each. Each file is about 76 MB fileSplit.sh takes a directory of input files and prepares them for use with EMR (more on this later) sjb.simpleMapper.R takes the name of a file from the command line, gets it from s3, runs a regression, hands back the coefficients. These coefficients are then aggregated using aggregate, a standard Hadoop reducer . . . . . .
  • 24. Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 25. Mapper functions INPUT: <STDIN>. This can be A seed to a random number generator Raw data text to process A list of file names to process - we are doing this one. OUTPUT: <STDOUT> (print it!), which next goes to the reducer. . . . . . .
  • 26. General R Mapper Code Outline 1 t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$ ) ” , ” ” , line ) 2 con <− f i l e ( ” s t d i n ” , open = ” r ” ) 3 w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn = FALSE ) ) > 0 ) { 4 l i n e <− t r i m W h i t e S p a c e ( l i n e ) 5 6 #p r o c e s s and p r i n t r e s u l t s 7 } 8 c l o s e ( con ) . . . . . .
  • 27. Simple Mapper file: sjb.simpleMapper.R Algorithm: get the file from s3 read it run regression print results in a way that aggregate can read . . . . . .
  • 28. Lets run it! . . . . . .
  • 29. Overview 1. Made some data with makeData.R 2. Used fileSplit.sh to make lists of files to grab from s3. These lists will be fed into the mapper. Then transferred the data and lists to s3. See moveToS3.sh for a list of commands, but don’t try to run this directly. 3. sjb.simpleMapper.R reads lines. Each line is a file. Opens the file, does some work, prints some output. 4. Configure job on EMR using AWS Management Console. Using the standard aggregator to aggregate results. . . . . . .
  • 30. Numbers Consider this, in less than 10 min Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 cores each) Installed Linux OS and Hadoop software on all nodes Distribute approx. 20GB of data to the nodes Run some analysis in R Aggregate the results Shut down the cluster . . . . . .
  • 31. Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 32. UsefulLinks Good EMR R Discussion Hadoop on EMR with C# and F# Hadoop Aggregate . . . . . .
  • 33. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 34. Description From the project website: Segue has a simple goal: Parallel functionality in R; two lines of code; in under 15 minutes. J.D.Long From segue homepage: http://code.google.com/p/segue/ . . . . . .
  • 35. AWS API - the segue underlying API stands for Application Program Interface All Amazon Web Services have API’s, which allow programmatic access. This exposes many more features than the AWS Managment Console For example, through the API one can start and stop a cluster without adding jobs, add nodes to a running cluster, etc. Using the API, you can write programs and treat clusters as the native objects segue is such a program . . . . . .
  • 36. segue usage Segue is ideal for CPU bound applications - e.g. simulations replaces lapply, which applies a function to elements of a list, with emrlapply, which distributes the evaluation of the function to a cluster via Elastic Map Reduce the list can be anything - seeds to a random number generator, matrices to invert, data frames to analyse, etc. . . . . . .
  • 37. Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 38. code overview Note: Code available on my website, http://econsteve.com/r. Showing 3 levels of optimization: For loops to matrices Evaluating firms on multicores Evaluating firms on multiple computers on EC2 . . . . . .
  • 39. Simulated MLE We use the simulator [T ] ∑ N 1∑ ∏ R i ˆ ln LNR = ln h(yit |xit , θui r R i=1 r =1 t=1 where i ∈ N is a person among people, or firm in a set of firms. R √ is a number of of simulations to do, where R ∝ N, and Ti is the length of the data for firm i. . . . . . .
  • 40. With for loops - R pseudocode p a n e l L o g L i k . s i m p l e <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) { l o g L i k <− 0 u i r <− qnorm ( s e e d M a t r i x ) f o r ( n i n 1 :N) { LiR <− 0 ; f o r ( r i n 1 : R) { myProduct <− 1 a l p h a . r <− mu . a + u i r [ r , ( 2 ∗n ) −1] ∗ s i g m a . a b e t a . r <− mu . b + u i r [ r , ( 2 ∗n ) ] ∗ s i g m a . b f o r ( t i n 1 : T) { # f i = r e s i d u a l u s i n g Y , THETA myProduct <− myProduct ∗ f i } LiR <− LiR + myProduct L i <− LiR /R l o g L i k <− l o g L i k + l o g ( L i ) } # end f o r r i n R } # end f o r n return ( logLik ) } . . . . . .
  • 41. With for loops - R pseudocode We then maximize the likelihood function as: o p t i m R e s <− optim (THETA . i n i t 1 , p a n e l L o g L i k . s i m p l e , This is extremely slow on one processor, and does not lend itself to parallelization. (30 min for 60 firms - didn’t bother to test more). . . . . . .
  • 42. Opt 1 - matrices, lists, lapply We adopt a new approach with the following rules: Structure the data as a list of lists, where each sublist contains the data, ticker symbol, and uir for the relevant coefficients Make a firm (i ∈ N) likelihood function, and an outer panel likelihood function which sums the results of the firms . . . . . .
  • 43. Opt 1 - matrices, lists, lapply - firm likelihood # t h i s s h o u l d be an e x t r e m e l y f a s t f i r m L i k e l i h o o d f u n c t i o n f i r m L i k e l i h o o d <− f u n c t i o n ( d a t a L i s t I t e m , THETA, R) { s i g m a . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ; s i g m a . a <− THETA [3] mu . b <− THETA [ 4 ] ; s i g m a . b <− THETA [ 5 ] d a t a . n <− d a t a L i s t I t e m $DATA; X . n <− d a t a . n$X ; Y . n <− d a t a . n$Y ; T <− nrow ( d a t a . n ) u i r A l p h a <− d a t a L i s t I t e m $UIRALPHA u i r B e t a <− d a t a L i s t I t e m $UIRBETA a l p h a . rmat <− mu . a + u i r A l p h a ∗ s i g m a . a b e t a . rmat <− mu . b + u i r B e t a ∗ s i g m a . b Y t S t a c k <− re pm at (Y . n , R , 1 ) X t S t a c k <− re pm at (X . n , R , 1 ) r e s i d M a t <− Y t S t a c k − a l p h a . rmat − X t S t a c k ∗ b e t a . rmat f i t M a t <− ( 1 / ( s i g m a . e ∗ s q r t ( 2 ∗ p i ) ) ) ∗ exp ( −( r e s i d M a t ˆ 2 ) / ( 2 ∗ sigma . e ˆ2) ) myProductVec <− a p p l y ( f i t M a t , 1 , pr o d ) L i 2 <− sum ( myProductVec ) /R return ( Li2 ) } . . . . . .
  • 44. The list-based outer loop p a n e l L o g L i k . f a s t e r <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) { # t h e s e e d m a t r i x a s R rows , and 2 ∗N c o l u m n s w h e r e t h e r e a r e N f i r m s and 2 p a r a m e t e r s o f i n t e r e s t ( a l p h a and beta ) u i r <− qnorm ( s e e d M a t r i x ) R <− nrow ( s e e d M a t r i x ) # n o t i c e t h a t we can c a l c u l a t e t h e l i k e l i h o o d s independently for # e a c h f i r m , s o we can make a f u n c t i o n and u s e l a p p l y . T h i s w i l l be # useful for parallelization f i r m L i k <− l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R) l o g L i k <− sum ( l o g ( u n l i s t ( f i r m L i k ) ) ) return ( logLik ) } . . . . . .
  • 45. Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 46. The list-based outer loop - multicore Use the R multicore library, and replace lapply with mclapply at the outer loop. library ( multicore ) ... f i r m L i k <− m c l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R) This will lead to some substantial speedups. . . . . . .
  • 47. multicore N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop > proc . time ( ) user syst em e l a p s e d 389.180 36.960 125.674 N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL > proc . time ( ) user syst em e l a p s e d 2705.77 2686.08 417.74 N: 5000 R: 710 T: 80 logLike: -870744.4 > proc . time ( ) user system elapsed 16206.480 16067.150 2768.588 multicore can provide quick and easy parallelization. Write program so that the parallel part is an operation on a list, then replace lapply with mclapply. . . . . . .
  • 48. Bad . . . . . .
  • 49. Good . . . . . .
  • 50. multicore is nice for optimizing a local job. Most machines today have at least 2 cores. Many have 4 or 8. However, that is still only 1 machine. Let’s use n of them → . . . . . .
  • 51. Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 52. installing segue Install prerequisite packages rjava and catools. On Ubuntu linux: sudo apt−g e t i n s t a l l r−c r a n −r j a v a r−c r a n −c a t o o l s Then, download and install segue http://code.google.com/p/segue/ . . . . . .
  • 53. Using segue Now in R we do: > library ( segue ) As we will be using are AWS account, we are going to need to set credentials so that other people can’t launch clusters in our name. To get our credentials, go to: http://aws.amazon.com/account/ and click “Security Credentials”. Go back into R. setCredentials (" ABC123 " , " REALLY + LONG +12312312+ STRING +456456") . . . . . .
  • 54. Firing up the cluster in segue use the createCluster command. c r e a t e C l u s t e r ( n u m I n s t a n c e s =2 , c r a n P a c k a g e s , filesOnNodes , r O b j e c t s O n N o d e s , e n a b l e D e b u g g i n g=FALSE , i n s t a n c e s P e r N o d e , m a s t e r I n s t a n c e T y p e=”m1 . s m a l l ” , s l a v e I n s t a n c e T y p e=”m1 . small ” , l o c a t i o n=” us−e a s t −1a ” , ec2KeyName , c o p y . image=FALSE , otherBootstrapActions , sourcePackagesToInstall ) In our case, lets fire up 10 m2.4xlarge. This gives us 80 cores and 684 GB of RAM to play with. . . . . . .
  • 55. parallel random number generation >m y L i s t <− NULL >s e t . s e e d ( 1 ) >f o r ( i i n 1:10){ a <− c ( rnorm ( 9 9 9 ) , NA) m y L i s t [ [ i ] ] <− a } >o u t p u t L o c a l <− l a p p l y ( m y L i s t , mean , na . rm=T) >outputEmr <− e m r l a p p l y ( m y C l u s t e r , m y L i s t , mean , na . rm=T) > a l l . e q u a l ( outputEmr , o u t p u t L o c a l ) [ 1 ] TRUE segue handles this for you. This is very important for simulation. . . . . . .
  • 56. Monte Carlo π estimation e s t i m a t e P i <− f u n c t i o n ( s e e d ) { set . seed ( seed ) numDraws <− 1 e6 r <− . 5 #r a d i u s . . . i n c a s e t h e u n i t c i r c l e i s t o o b o r i n g x <− r u n i f ( numDraws , min=−r , max=r ) y <− r u n i f ( numDraws , min=−r , max=r ) i n C i r c l e <− i f e l s e ( ( x ˆ2 + y ˆ 2 ) ˆ . 5 < r , 1 , 0 ) r e t u r n ( sum ( i n C i r c l e ) / l e n g t h ( i n C i r c l e ) ∗ 4 ) } s e e d L i s t <− a s . l i s t ( 1 : 1 0 0 ) r e q u i r e ( segue ) m y E s t i m a t e s <− e m r l a p p l y ( m y C l u s t e r , s e e d L i s t , e s t i m a t e P i ) myPi <− Reduce ( sum , m y E s t i m a t e s ) / l e n g t h ( m y E s t i m a t e s ) > f o r m a t ( myPi , d i g i t s =10) [ 1 ] ” 3.14166556 ” . . . . . .
  • 57. parallel MLE Using code from sml.segue.R on my website. It is exactly the same as the multicore example, but with the addition of 2 lines to start the cluster. . . . . . .
  • 58. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 59. EC2 has GPUs Cluster GPU Quadruple Extra Large Instance 22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core Nehalem architecture) 2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet) API name: cg1.4xlarge The Fermi chip is important because they have ECC memory, so simulations are accurate. These are much more robust than gamer GPUs - cost $2800 per card. Each machine has 2. You can use for $2.10 per hour. . . . . . .
  • 60. RHIPE RHIPE = R and Hadoop Integrated Processing Environment http://www.stat.purdue.edu/~sguha/rhipe/ Implements rhlapply function Exposes much more of Hadoop’s underlying functionality, including the HDFS ⇒ May be better for large data applications . . . . . .
  • 61. StarCluster I Allows instantiation of generic clusters on EC2 Use MPI (Message Passing Interface) for much more complicated parallel programs. E.g., holding one giant matrix accross the RAM of several nodes From their page: Simple configuration with sensible defaults Single ”start” command to automatically launch and configure one or more clusters on EC2 Support for attaching and NFS-sharing Amazon Elastic Block Storage (EBS) volumes for persistent storage across a cluster Comes with a publicly available Amazon Machine Image (AMI) configured for scientific computing AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, and other useful libraries . . . . . .
  • 62. StarCluster II Clusters are automatically configured with NFS, Sun Grid Engine queuing system, and password-less ssh between machines Supports user-contributed ”plugins” that allow users to perform additional setup routines on the cluster after StarCluster’s defaults http://web.mit.edu/stardev/cluster/ . . . . . .
  • 63. Matlab You can do it in theory, but you need either a license manager or use Matlab compiler It will cost you. Whitepaper from Mathworks: http://www.mathworks.com/ programs/techkits/ec2_paper.html May be able to coax EMR run a compiled Matlab script, but you would have to bootstrap each machine to have the libraries required to run compiled Matlab applications Mathworks has no incentive to support this behaviour Requires toolboxes ($$$). . . . . . .
  • 64. Table of Contents Tools Overview Hadoop Amazon Web Services A Simple EMR and R Example The R code - mapper Resources List segue and a SML Example Simulated Maximum Likelihood Example multicore - on the way to segue diving into segue Other EC2 Software Options Conclusion . . . . . .
  • 65. EC2 and Hadoop are Extremely Powerful Huge and active community behind both Hadoop (Apache) and EC2 (Amazon). EC2 and AWS in general allow you to change the way you think about computing resources, as a service rather than as devices to manage. New AWS features are always being added . . . . . .
  • 66. AWS in Education AMAZON WILL GIVE YOU MONEY Researcher - send them your proposal, they send you credits, you thank them in the paper. Teacher - if you are teaching a class, each student gets $100 credit, good for one year. This would be great for teaching econometrics, where you can provide a machine image with software and data already available. Additionally, AWS for your backups (S3) and other tech needs . . . . . .
  • 67. Resources My website http://www.econsteve.com/r for the code in this presentation AWS Managment Console http://aws.amazon.com/console/ AWS Blog http://aws.typepad.com AWS in Education http://aws.amazon.com/education/ . . . . . .