Revolution R Enterprise 6.1 includes two important advances in high performance predictive analytics with R: (1) big data decision trees, and (2) the ability to easily extract and perform predictive analytics on data stored in the Hadoop Distributed File System (HDFS).
Classification and regression trees are among the most frequently used algorithms for data analysis and data mining. The implementation provided in Revolution Analytics’ RevoScaleR package is parallelized, scalable, distributable, and designed with big data in mind.
Decision trees and all of the other high performance prediction analytics functions provided with RevoScaleR (such as linear and logistic regression, generalized linear models, and k-means clustering) can now also be used to analyze data stored in the HDFS file system. After specifying the connection parameters to the HDFS file system, some or all of the data can be directly explored, analyzed or quickly and efficiently extracted into a native file system.
Reproducibility with Checkpoint & RRO - NYC R Conference
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data
1. Revolution Confidential
New A dvanc es in High
P erformanc e A nalytic s
with R : 'B ig Data'
Dec is ion Trees and
A nalys is of Hadoop
Data
P res ented by:
S ue R anney
V P P roduct Development
Revolution Confidential
2. In today’s webc as t: Revolution Confidential
High Performance Analytics (HPA) with
Revolution R Enterprise
‘Big Data’ Decision Trees
Revolution’s HPA with Hadoop Data
Resources, Q&A
2
3. R evolution R E nterpris e: What Revolution Confidential
G ets Ins talled?
Latest stable version of Open-Source R
High performance math libraries
RevoScaleR package that adds:
High performance ‘big data’ capabilities to R
Access to a variety of ‘data sources’ (e.g., SAS, SPSS,
text files, ODBC)
Ability to compute in a variety of ‘compute contexts’
(e.g., Windows/Linux workstation/server, Microsoft
HPC Server cluster, Azure Burst, IBM Platform LSF
cluster)
High performance computing capabilities
Integrated Development Environment based on Visual
Studio technology (for Windows): the R Productivity
Environment (RPE)
Revolution R Enterprise 5.0 Webinar 3
4. High P erformanc e A nalytic s (HPA ) in
R evoS c aleR Revolution Confidential
High Performance Computing + Data
Full-featured, fast, and scalable analysis
functions
Same code works on small and big data, and a
variety of data sources
Same code works on a variety of compute
contexts - a laptop, server, cluster, or the cloud
Scales approximately linearly with the number
of observations – without increasing memory
requirements
Revolution R Enterprise 4
5. R evoS c aleR : HPA A lgorithms Revolution Confidential
Descriptive statistics (rxSummary)
Tables and cubes (rxCube, rxCrossTabs)
Correlations/covariances (rxCovCor, rxCor,
rxCov, rxSSCP)
K means clustering (rxKmeans)
Linear regressions (rxLinMod)
Logistic regressions (rxLogit)
Generalized Linear Models (rxGlm)
Predictions (scoring) (rxPredict)
Decision Trees (rxDTree) NEW!
Revolution R Enterprise 5
6. Dec is ion Trees Revolution Confidential
Relatively easy-to-interpret models
Widely used in a variety of disciplines. For example,
Predicting which patient characteristics are associated with
high risk of, for example, heart attack.
Deciding whether or not to offer a loan to an individual
based on individual characteristics.
Predicting the rate of return of various investment
strategies
Retail target marketing
Can handle multi-factor response easily
Useful in identifying important interactions
Revolution R Enterprise 6
7. Dec is ion Tree Types Revolution Confidential
Classification tree: predict what ‘class’ or
‘group’ an observation belongs in
(dependent variable is a factor) for each
terminal node or leaf
Regression tree: predict average value of
dependent variable for each terminal node
or leaf
Revolution R Enterprise 7
8. S imple E xample: Marketing R es pons e Revolution Confidential
Data set containing the following information:
Response: Was response to a phone call, email, or
mailing?
Age
Income
Marital status
Attended college?
Revolution R Enterprise 8
9. S imple E xample: S pec ifying the model Revolution Confidential
treeOut <- rxDTree(response~ age
+ income + college + marital,
data = rdata)
where rdata is the name of the data set
Revolution R Enterprise 9
10. S imple E xample: B as ic Output Revolution Confidential
Information on the split, the number of observations in
the node, the number that match the y value, and the y
probabilities
1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000)
2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743)
4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639)
8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) *
9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) *
5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901)
10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054)
20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) *
21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) *
11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) *
3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) …
Revolution R Enterprise 10
11. S imple E xample: Vis ual R epres entation Revolution Confidential
Root
No College
College
Age < 65 Age >=
Age >= 40 Age < 40 65: Mail
Single Married:
Age >= 65: Married:
Age < 65: Single Email
Mail Email
Phone
Age < 40 Age >=
40: Email
Income Income <
>= 30: 30: Email
Phone Income Income <
>= 30: 30: Email
Phone
Revolution R Enterprise 11
12. S c aling HPA with R evoS c aleR Revolution Confidential
RevoScaleR functions can read from data sets on disk in
chunks, so you can increase the number of observations in
the data set beyond what can be analyzed in memory all at
once
RevoScaleR analysis functions process chunks of data in
parallel, taking greater advantage of your computing
resources (Parallel External Memory Algorithms)
Multiple cores on a desktop/server
Cluster/grids have added advantage of more hard drives
for storing & accessing data
Windows HPC Server Cluster
“Burst” computations to Azure in the cloud
IBM Platform LSF Grid
Revolution R Enterprise 12
13. T he ‘B ig Data’ Dec is ion Tree A lgorithm Revolution Confidential
Classical algorithms for building a decision tree
sort all continuous variables in order to decide
where to split the data.
This sorting step becomes time and memory
prohibitive when dealing with large data.
rxDTree bins the data rather than sorting,
computing histograms to create empirical
distribution functions of the data
rxDTree partitions the data horizontally, processing
in parallel different sets of observations
Revolution R Enterprise 13
14. Revolution Confidential
Us eful rxDTree A rguments for B ig Data
cp: complexity parameter. Increasing cp will
decrease the number of splits attempted
maxDepth: the maximum depth of any tree
node. The computations take much longer at
greater depth, so lowering maxDepth can
greatly speed up computation time.
maxNumBins: the maximum number of bins
to use to cut numeric data. Decreasing
maxNumBins will speed up computation
time.
Revolution R Enterprise 14
15. ‘B ig Data’ E xample Revolution Confidential
CDC Report in Jan. 2012
15
16. T he U.S . B irth Data: 1985 - 2009 Revolution Confidential
Public-use data sets containing information on
all births in the United States for each year from
1985 to 2009 are available to download:
http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
“These natality files are gigantic; they’re
approximately 3.1 GB uncompressed. That’s a
little larger than R can easily process” – Joseph
Adler, R in a Nutshell
I’ve imported key variables from each year into
a single .xdf file with over 100 million
observations.
16
17. Revolution Confidential
R egres s ion Tree: Multiple B irths
Call:
rxDTree(formula = IsMultiple ~ DadAgeR8
+ MAGER + FRACEREC + FHISP_REC +
MRACEREC + MHISP_REC + DOB_YY,
data = birthAllC,
maxDepth = 6, cp = 1e-05,
blocksPerRead = 10, verbose = 1)
File:
C:RevolutionDataCDCBirthUS.xdf
Number of valid observations: 100672041
Number of missing observations: 0
Revolution R Enterprise 17
18. L eaves with L owes t P erc ent of Multiple
B irths Revolution Confidential
Mom is not black and under the 1.3%
age of 20
Mom is Asian or Pacific Islander 1.6%
(and not Hispanic) and is between
22 and 28 years of age. The birth
is before 1997
Mom is black and under the age 1.7%
of 18
18
19. L eaves with Highes t P erc ent of Multiple
B irths Revolution Confidential
Mom is over 47 years old and 38.6%
the birth is after 1996
Mom is white, non-Hispanic, is 28.1%
between 45 and 47 years old,
and the birth is after 1996
Mom is Hispanic, is between 15.5%
45 and 47 years old, and the
birth is after 1996
19
21. R evoS c aleR with Hadoop Data F iles NE W Revolution Confidential
The Hadoop Distributed File System (HDFS)
is highly fault-tolerant and
is designed to be deployed on low-cost
hardware.
RevoScaleR supports accessing data in the
HDFS file system for import or for direct
analysis
21
22. R evoS c aleR Data S ourc es Revolution Confidential
Data Sources can be used for import or directly for
analysis
External: delimited text, fixed format text, SAS, SPSS,
ODBC connections
Provided with RevoScaleR: efficient .xdf file format
Data Sources contain information about their file
system
Delimited text and .xdf data sources can both be used
with the HDFS file system
Data sources are used as input to HPA functions
22
23. A n E xample Us ing Hadoop Data Revolution Confidential
Hadoop cluster in our office
Five nodes of commodity hardware
Red Hat Enterprise Linux (RHEL) operating system
Cloudera’s Hadoop (CDH3)
Also has IBM Platform LSF workload management
system installed (not required to use HDFS data)
My colleague, Dawn Kinsey, recorded a data
analysis session
22 comma delimited files stored in HDFS
Contain information on U.S. flight arrivals, 1997 – 2008
Revolution R Enterprise 23
24. S teps in A nalys is Revolution Confidential
Set up a ‘file system’ object and a ‘data source’
object
Explore the HDFS airline data for the year 2000
directly
Extract variables of interest from all the files into an
.xdf file in the native file system
Use R’s great plotting capabilities on summary
information
Perform a big logistic regression on an .xdf file
stored in HDFS
Revolution R Enterprise 24
26. T hank You! Revolution Confidential
Download slides, replay from today’s webinar
http://bit.ly/QJfR4A
Learn more about Revolution R Enterprise
Overview: revolutionanalytics.com/products
New feature videos:
http://www.revolutionanalytics.com/products/new-features.php
Contact Revolution Analytics
http://bit.ly/hey-revo
November 29: Real-Time Big Data Analytics: from Deployment
to Production
David Smith, VP Marketing and Community, Revolution Analytics
www.revolutionanalytics.com/news-events/free-webinars
26
27. Revolution Confidential
The leading commercial provider of software and support for the
popular open source R statistics language.
www.revolutionanalytics.com
+1 (650) 646 9545
Twitter: @RevolutionR
27