6. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
7. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
8. Algorithms and Implementations
“Stupidly parallel” - e.g. a for loop where each iteration is
independent.
Only 1 computer? (need 1-8 cores) - use the R multicore
package on a single EC2 node.
Need more? Use Hadoop / MapReduce - can do complicated
mapping and aggregation, in addition to the stupidly parallel
stuff
MapReduce - use Hadoop directly (Java), Hadoop Streaming
(any programming language), rhipe R package (R on
Hadoop).
. . . . . .
9. In this presentation, we will be using Hadoop either directly
through Elastic MapReduce or indirectly via the Segue package for
R
. . . . . .
10. Alternatives
Wait a long time
Use multicores, eg.
http://www.rforge.net/doc/packages/multicore/mclapply.html
Take over the computer lab and start jobs by hand
Buy your own cluster (huge initial cost and will be unutilized
most of the time)
. . . . . .
11. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
12. What is it?
Hadoop is made by the Apache Software Foundation, which
makes open source software. Contributors to the foundation
are both large companies and individuals.
Hadoop Common: The common utilities that support the
other Hadoop subprojects.
HDFS: A distributed file system that provides high throughput
access to application data.
MapReduce: A software framework for distributed processing
of large data sets on compute clusters.
Often, when people say “Hadoop” they mean Hadoop’s
implementation of the map reduce algorithm.
Algorithm made by google. Documented here:
http://labs.google.com/papers/mapreduce.html .
. . . . .
13. What is it for?
Used to process many TB of webserver logs for metrics, target
ad placement, etc
Users include:
Google - calculating pagerank, processing traffic, etc.
Yahoo - > 100,000 CPUs in various clusters, including a 4,000
node cluster. Used for ad placement, etc.
LinkedIn - huge social network graphs - “you may know...”
Amazon - creating product search indices
See: http://wiki.apache.org/hadoop/PoweredBy
. . . . . .
15. Algorithm
The idea is that the job is broken into map and reduce steps.
Mapper processes input and creates chunks
Reducer aggregates the chunks
Hadoop provides a Java implementation of this algorithm.
Features include fault-tolerance, adding nodes on the fly, extreme
speed, and more.
Hadoop is implemented in Java, and Hadoop Streaming allows
mapper and reducers over any language, communicating over
<STDIN>, <STDOUT>.
. . . . . .
17. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
18. What is this cloud?
Cloud computing is the idea of abstracting away from
hardware
All data and computing resources are managed services
Pay per hour, based on need
. . . . . .
19. AWS Overview
Get ready for some acronyms! Amazon Web Services (AWS) is full
of them. The relevant ones are:
EC2 - Elastic Compute Cloud - Dynamically get N computers
for a few cents per hour. Computers range from micro
instances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl”
($2.00/hr) to GPU machines ($2.10/hr).
EMR - Elastic map Reduce - automates the instantiation of
Hadoop jobs. Builds the cluster, runs the job, completely in
the background
S3 - Simple Storage Service - Store VERY large objects in the
cloud.
RDS - Relational Database Service - implementation of
MySQL database. Easy way to store data and later load into
R with package RMySQL. E.g.
select date,price from myTable where TICKER=’AMZN’
. . . . . .
21. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
22. Steps
1. Write mapper in R. The output will be aggregatred by
Hadoop’s aggregate function.
2. Create input files
3. Upload all to S3
4. Configure EMR job in AWS Management Console
5. Done!
. . . . . .
23. Files
The directory emr.simpleExample/simpleSimRmapper contains
the following
makeData.R generates 1000 csv files with 1,000,000 rows, 4
columns each. Each file is about 76 MB
fileSplit.sh takes a directory of input files and prepares
them for use with EMR (more on this later)
sjb.simpleMapper.R takes the name of a file from the
command line, gets it from s3, runs a regression, hands back
the coefficients. These coefficients are then aggregated using
aggregate, a standard Hadoop reducer
. . . . . .
24. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
25. Mapper functions
INPUT: <STDIN>. This can be
A seed to a random number generator
Raw data text to process
A list of file names to process - we are doing this one.
OUTPUT: <STDOUT> (print it!), which next goes to the
reducer.
. . . . . .
26. General R Mapper Code Outline
1 t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$ ) ” , ” ”
, line )
2 con <− f i l e ( ” s t d i n ” , open = ” r ” )
3 w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn =
FALSE ) ) > 0 ) {
4 l i n e <− t r i m W h i t e S p a c e ( l i n e )
5
6 #p r o c e s s and p r i n t r e s u l t s
7 }
8 c l o s e ( con )
. . . . . .
27. Simple Mapper
file: sjb.simpleMapper.R Algorithm:
get the file from s3
read it
run regression
print results in a way that aggregate can read
. . . . . .
29. Overview
1. Made some data with makeData.R
2. Used fileSplit.sh to make lists of files to grab from s3.
These lists will be fed into the mapper. Then transferred the
data and lists to s3. See moveToS3.sh for a list of
commands, but don’t try to run this directly.
3. sjb.simpleMapper.R reads lines. Each line is a file. Opens
the file, does some work, prints some output.
4. Configure job on EMR using AWS Management Console.
Using the standard aggregator to aggregate results.
. . . . . .
30. Numbers
Consider this, in less than 10 min
Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 cores
each)
Installed Linux OS and Hadoop software on all nodes
Distribute approx. 20GB of data to the nodes
Run some analysis in R
Aggregate the results
Shut down the cluster
. . . . . .
31. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
32. UsefulLinks
Good EMR R Discussion
Hadoop on EMR with C# and F#
Hadoop Aggregate
. . . . . .
33. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
34. Description
From the project website:
Segue has a simple goal: Parallel functionality in R; two
lines of code; in under 15 minutes.
J.D.Long
From segue homepage: http://code.google.com/p/segue/
. . . . . .
35. AWS API - the segue underlying
API stands for Application Program Interface
All Amazon Web Services have API’s, which allow
programmatic access. This exposes many more features than
the AWS Managment Console
For example, through the API one can start and stop a cluster
without adding jobs, add nodes to a running cluster, etc.
Using the API, you can write programs and treat clusters as
the native objects
segue is such a program
. . . . . .
36. segue usage
Segue is ideal for CPU bound applications - e.g. simulations
replaces lapply, which applies a function to elements of a
list, with emrlapply, which distributes the evaluation of the
function to a cluster via Elastic Map Reduce
the list can be anything - seeds to a random number
generator, matrices to invert, data frames to analyse, etc.
. . . . . .
37. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
38. code overview
Note: Code available on my website, http://econsteve.com/r.
Showing 3 levels of optimization:
For loops to matrices
Evaluating firms on multicores
Evaluating firms on multiple computers on EC2
. . . . . .
39. Simulated MLE
We use the simulator
[T ]
∑
N
1∑ ∏
R i
ˆ
ln LNR = ln h(yit |xit , θui
r
R
i=1 r =1 t=1
where i ∈ N is a person among people, or firm in a set of firms. R
√
is a number of of simulations to do, where R ∝ N, and Ti is the
length of the data for firm i.
. . . . . .
40. With for loops - R pseudocode
p a n e l L o g L i k . s i m p l e <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) {
l o g L i k <− 0
u i r <− qnorm ( s e e d M a t r i x )
f o r ( n i n 1 :N) {
LiR <− 0 ;
f o r ( r i n 1 : R) {
myProduct <− 1
a l p h a . r <− mu . a + u i r [ r , ( 2 ∗n ) −1] ∗ s i g m a . a
b e t a . r <− mu . b + u i r [ r , ( 2 ∗n ) ] ∗ s i g m a . b
f o r ( t i n 1 : T) {
# f i = r e s i d u a l u s i n g Y , THETA
myProduct <− myProduct ∗ f i
}
LiR <− LiR + myProduct
L i <− LiR /R
l o g L i k <− l o g L i k + l o g ( L i )
} # end f o r r i n R
} # end f o r n
return ( logLik )
}
. . . . . .
41. With for loops - R pseudocode
We then maximize the likelihood function as:
o p t i m R e s <− optim (THETA . i n i t 1 , p a n e l L o g L i k . s i m p l e ,
This is extremely slow on one processor, and does not lend itself to
parallelization. (30 min for 60 firms - didn’t bother to test more).
. . . . . .
42. Opt 1 - matrices, lists, lapply
We adopt a new approach with the following rules:
Structure the data as a list of lists, where each sublist contains
the data, ticker symbol, and uir for the relevant coefficients
Make a firm (i ∈ N) likelihood function, and an outer panel
likelihood function which sums the results of the firms
. . . . . .
43. Opt 1 - matrices, lists, lapply - firm likelihood
# t h i s s h o u l d be an e x t r e m e l y f a s t f i r m L i k e l i h o o d f u n c t i o n
f i r m L i k e l i h o o d <− f u n c t i o n ( d a t a L i s t I t e m , THETA, R) {
s i g m a . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ; s i g m a . a <− THETA
[3]
mu . b <− THETA [ 4 ] ; s i g m a . b <− THETA [ 5 ]
d a t a . n <− d a t a L i s t I t e m $DATA; X . n <− d a t a . n$X ; Y . n <− d a t a .
n$Y ;
T <− nrow ( d a t a . n )
u i r A l p h a <− d a t a L i s t I t e m $UIRALPHA
u i r B e t a <− d a t a L i s t I t e m $UIRBETA
a l p h a . rmat <− mu . a + u i r A l p h a ∗ s i g m a . a
b e t a . rmat <− mu . b + u i r B e t a ∗ s i g m a . b
Y t S t a c k <− re pm at (Y . n , R , 1 )
X t S t a c k <− re pm at (X . n , R , 1 )
r e s i d M a t <− Y t S t a c k − a l p h a . rmat − X t S t a c k ∗ b e t a . rmat
f i t M a t <− ( 1 / ( s i g m a . e ∗ s q r t ( 2 ∗ p i ) ) ) ∗ exp ( −( r e s i d M a t ˆ 2 ) / ( 2
∗ sigma . e ˆ2) )
myProductVec <− a p p l y ( f i t M a t , 1 , pr o d )
L i 2 <− sum ( myProductVec ) /R
return ( Li2 )
}
. . . . . .
44. The list-based outer loop
p a n e l L o g L i k . f a s t e r <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x )
{
# t h e s e e d m a t r i x a s R rows , and 2 ∗N c o l u m n s w h e r e t h e r e
a r e N f i r m s and 2 p a r a m e t e r s o f i n t e r e s t ( a l p h a and
beta )
u i r <− qnorm ( s e e d M a t r i x )
R <− nrow ( s e e d M a t r i x )
# n o t i c e t h a t we can c a l c u l a t e t h e l i k e l i h o o d s
independently for
# e a c h f i r m , s o we can make a f u n c t i o n and u s e l a p p l y .
T h i s w i l l be
# useful for parallelization
f i r m L i k <− l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)
l o g L i k <− sum ( l o g ( u n l i s t ( f i r m L i k ) ) )
return ( logLik )
}
. . . . . .
45. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
46. The list-based outer loop - multicore
Use the R multicore library, and replace lapply with mclapply at
the outer loop.
library ( multicore )
...
f i r m L i k <− m c l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)
This will lead to some substantial speedups.
. . . . . .
47. multicore
N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop
> proc . time ( )
user syst em e l a p s e d
389.180 36.960 125.674
N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL
> proc . time ( )
user syst em e l a p s e d
2705.77 2686.08 417.74
N: 5000 R: 710 T: 80 logLike: -870744.4
> proc . time ( )
user system elapsed
16206.480 16067.150 2768.588
multicore can provide quick and easy parallelization. Write
program so that the parallel part is an operation on a list, then
replace lapply with mclapply.
. . . . . .
50. multicore is nice for optimizing a local job.
Most machines today have at least 2 cores. Many have 4 or 8.
However, that is still only 1 machine. Let’s use n of them →
. . . . . .
51. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
52. installing segue
Install prerequisite packages rjava and catools. On Ubuntu linux:
sudo apt−g e t i n s t a l l r−c r a n −r j a v a r−c r a n −c a t o o l s
Then, download and install segue
http://code.google.com/p/segue/
. . . . . .
53. Using segue
Now in R we do:
> library ( segue )
As we will be using are AWS account, we are going to need to set
credentials so that other people can’t launch clusters in our name.
To get our credentials, go to:
http://aws.amazon.com/account/ and click “Security
Credentials”.
Go back into R.
setCredentials (" ABC123 " ,
" REALLY + LONG +12312312+ STRING +456456")
. . . . . .
54. Firing up the cluster in segue
use the createCluster command.
c r e a t e C l u s t e r ( n u m I n s t a n c e s =2 , c r a n P a c k a g e s ,
filesOnNodes ,
r O b j e c t s O n N o d e s , e n a b l e D e b u g g i n g=FALSE , i n s t a n c e s P e r N o d e ,
m a s t e r I n s t a n c e T y p e=”m1 . s m a l l ” , s l a v e I n s t a n c e T y p e=”m1
. small ” ,
l o c a t i o n=” us−e a s t −1a ” , ec2KeyName , c o p y . image=FALSE ,
otherBootstrapActions , sourcePackagesToInstall )
In our case, lets fire up 10 m2.4xlarge. This gives us 80 cores and
684 GB of RAM to play with.
. . . . . .
55. parallel random number generation
>m y L i s t <− NULL
>s e t . s e e d ( 1 )
>f o r ( i i n 1:10){
a <− c ( rnorm ( 9 9 9 ) , NA)
m y L i s t [ [ i ] ] <− a
}
>o u t p u t L o c a l <− l a p p l y ( m y L i s t , mean , na . rm=T)
>outputEmr <− e m r l a p p l y ( m y C l u s t e r , m y L i s t , mean ,
na . rm=T)
> a l l . e q u a l ( outputEmr , o u t p u t L o c a l )
[ 1 ] TRUE
segue handles this for you. This is very important for simulation.
. . . . . .
56. Monte Carlo π estimation
e s t i m a t e P i <− f u n c t i o n ( s e e d ) {
set . seed ( seed )
numDraws <− 1 e6
r <− . 5 #r a d i u s . . . i n c a s e t h e u n i t c i r c l e i s t o o b o r i n g
x <− r u n i f ( numDraws , min=−r , max=r )
y <− r u n i f ( numDraws , min=−r , max=r )
i n C i r c l e <− i f e l s e ( ( x ˆ2 + y ˆ 2 ) ˆ . 5 < r , 1 , 0 )
r e t u r n ( sum ( i n C i r c l e ) / l e n g t h ( i n C i r c l e ) ∗ 4 )
}
s e e d L i s t <− a s . l i s t ( 1 : 1 0 0 )
r e q u i r e ( segue )
m y E s t i m a t e s <− e m r l a p p l y ( m y C l u s t e r , s e e d L i s t , e s t i m a t e P i )
myPi <− Reduce ( sum , m y E s t i m a t e s ) / l e n g t h ( m y E s t i m a t e s )
> f o r m a t ( myPi , d i g i t s =10)
[ 1 ] ” 3.14166556 ”
. . . . . .
57. parallel MLE
Using code from sml.segue.R on my website. It is exactly the
same as the multicore example, but with the addition of 2 lines to
start the cluster.
. . . . . .
58. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
59. EC2 has GPUs
Cluster GPU Quadruple Extra Large Instance
22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon
X5570, quad-core Nehalem architecture)
2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instance
storage 64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge
The Fermi chip is important because they have ECC memory, so
simulations are accurate. These are much more robust than gamer
GPUs - cost $2800 per card. Each machine has 2. You can use for
$2.10 per hour.
. . . . . .
60. RHIPE
RHIPE = R and Hadoop Integrated Processing Environment
http://www.stat.purdue.edu/~sguha/rhipe/
Implements rhlapply function
Exposes much more of Hadoop’s underlying functionality,
including the HDFS ⇒
May be better for large data applications
. . . . . .
61. StarCluster I
Allows instantiation of generic clusters on EC2
Use MPI (Message Passing Interface) for much more
complicated parallel programs. E.g., holding one giant matrix
accross the RAM of several nodes
From their page:
Simple configuration with sensible defaults
Single ”start” command to automatically launch and
configure one or more clusters on EC2
Support for attaching and NFS-sharing Amazon Elastic Block
Storage (EBS) volumes for persistent storage across a cluster
Comes with a publicly available Amazon Machine Image
(AMI) configured for scientific computing
AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, and
other useful libraries
. . . . . .
62. StarCluster II
Clusters are automatically configured with NFS, Sun Grid
Engine queuing system, and password-less ssh between
machines
Supports user-contributed ”plugins” that allow users to
perform additional setup routines on the cluster after
StarCluster’s defaults
http://web.mit.edu/stardev/cluster/
. . . . . .
63. Matlab
You can do it in theory, but you need either a license manager
or use Matlab compiler
It will cost you.
Whitepaper from Mathworks: http://www.mathworks.com/
programs/techkits/ec2_paper.html
May be able to coax EMR run a compiled Matlab script, but
you would have to bootstrap each machine to have the
libraries required to run compiled Matlab applications
Mathworks has no incentive to support this behaviour
Requires toolboxes ($$$).
. . . . . .
64. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
65. EC2 and Hadoop are Extremely Powerful
Huge and active community behind both Hadoop (Apache)
and EC2 (Amazon).
EC2 and AWS in general allow you to change the way you
think about computing resources, as a service rather than as
devices to manage.
New AWS features are always being added
. . . . . .
66. AWS in Education
AMAZON WILL GIVE YOU MONEY
Researcher - send them your proposal, they send you credits,
you thank them in the paper.
Teacher - if you are teaching a class, each student gets $100
credit, good for one year. This would be great for teaching
econometrics, where you can provide a machine image with
software and data already available.
Additionally, AWS for your backups (S3) and other tech needs
. . . . . .
67. Resources
My website http://www.econsteve.com/r for the code in
this presentation
AWS Managment Console
http://aws.amazon.com/console/
AWS Blog http://aws.typepad.com
AWS in Education http://aws.amazon.com/education/
. . . . . .