Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
2. Model Buliding with RevoScaleR
Agenda:
The three realms of data
What is RevoScaleR?
RevoScaleR working beside Hadoop
RevoScaleR running within Hadoop
Run some code
2
3. The 3 Realms of Data
Bridging the gaps between architectures
4. The 3 Realms of Data
Number of rows
The realm of
“chunking”
>1012
1011
The realm of
massive data
Data
in
Data in
a File
106
Data
In
Memory
Multipl
e
Files
Architectural complexity
4
6. RevoScaleR
An R package ships exclusively with Revolution R
Enterprise
Revolution R Enterprise
Implements Parallel External Memory Algorithms
(PEMAs)
Provides functions to:
DeployR
ConnectR
– Import, Clean, Explore and Transform Data
– Statistical Analysis and Predictive Analytics
– Enable distributed computing
RevoScaleR
DistributedR
Scales from small local data to huge distributed
data
The same code works on small and big data, and
on workstation, server, cluster, Hadoop
6
7. Parallel External Memory Algorithms (PEMA’s)
Built on a platform (DistributeR)
that efficiently parallelizes a
broad class of statistical, data
mining and machine learning
algorithms
Process data a chunk at a time in
parallel across cores and nodes:
1.
2.
3.
4.
Initialize
Process Chunk
Aggregate
Finalize
Revolution R Enterprise
DeployR
ConnectR
RevoScaleR
DistributedR
7
8. RevoScaleR PEMAs
Statistical Modeling
Machine Learning
Predictive Models
Covariance, Correlation, Sum of Squares
Multiple Linear Regression
Generalized Linear Models:
All exponential family
distributions, Tweedie
distribution.
Standard link functions
user defined distributions & link
functions.
Classification & Regression Trees
Decision Forests
Predictions/scoring for models
Residuals for all models
Data Visualization
Histogram
Line Plot
Lorenz Curve
ROC Curves
Variable Selection
Stepwise Regression
PCA
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Simulation
Parallel random number
generators for Monte
Carlo
8
9. GLM comparison using in-memory
data: glm() and ScaleR’s rxGlm()
Revolution R Enterprise
9
10. PEMAs: Optimized for Performance
Arbitrarily large number of
rows in a fixed amount of
memory
Scales linearly
with the number of rows
with the number of nodes
Scales well
with the number of cores per
node
with the number of parameters
Efficient
Computational algorithms
Memory management: minimize
copying
File format: fast access by row and
column
Heavy use of C++
Models
pre-analyzed to detect and remove
duplicate computations and points of
failure (singularities)
Handle categorical variables
efficiently
10
11. Write Once. Deploy Anywhere.
Hadoop
Hortonworks
Cloudera
EDW
IBM
Teradata
Clustered Systems
Platform LSF
Microsoft HPC
Workstations & Servers
Desktop
Server
Linux
In the Cloud
Microsoft Azure Burst
Amazon AWS
DeployR
ConnectR
RevoScaleR
DistributedR
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
11
13. Revolution R
Enterprise
Architecture
Use Hadoop for data
storage and data
preparation
Use RevoScaleR on
a connected server
for predictive
modeling
Use Hadoop for
model deployment
14. A Simple Goal: Hadoop As An R Engine.
Hadoop
Run Revolution R Enterprise code In
Hadoop without change
Provide RevoScaleR Pre-Parallelized
Algorithms
Eliminate:
The Need To “Think In MapReduce”
Data Movement
14
15. Revolution R
Enterprise
HDFS
Name Node
Architecture
MapReduce
Data Node
Use RevoScaleR inside
Hadoop for:
• Data preparation
• Model building
• Custom small-data
parallel programming
• Model deployment
• Late 2013: Big-data
predictive models with
ScaleR
Data Node
Data Node
Data Node
Data Node
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
16. RRE in Hadoop
HDFS
Name Node
MapReduce
Data Node
Data Node
Data Node
Data Node
Data Node
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
16
17. RRE in Hadoop
HDFS
Name Node
MapReduce
Data Node
Data Node
Data Node
Data Node
Data Node
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
17
18. RevoScaleR on Hadoop
Each pass through the data is one MapReduce job
Prediction (Scoring), Transformation, Simulation:
– Map tasks store results in HDFS or return to client
Statistics, Model Building, Visualization:
– Map tasks produce “intermediate result objects” that are
aggregated by a Reduce task
– Master process decides if another pass through the data is
required
Data can be cached or stored in XDF binary format for
increased speed, especially on iterative algorithms
Revolution R Enterprise
18
21. Sample code: logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData )
21
22. Sample code for logit on Hadoop
#
Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData)
22