New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

New A dvanc es in High
P erformanc e A nalytic s
with R : 'B ig Data'
Dec is ion Trees and
A nalys is of Hadoop
Data
P res ented by:
S ue R anney
V P P roduct Development


In today’s webc as t: Revolution Confidential

 High Performance Analytics (HPA) with
Revolution R Enterprise
 ‘Big Data’ Decision Trees
 Revolution’s HPA with Hadoop Data
 Resources, Q&A

2

R evolution R E nterpris e: What Revolution Confidential

G ets Ins talled?
 Latest stable version of Open-Source R
 High performance math libraries
 RevoScaleR package that adds:
 High performance ‘big data’ capabilities to R
 Access to a variety of ‘data sources’ (e.g., SAS, SPSS,
text files, ODBC)
 Ability to compute in a variety of ‘compute contexts’
(e.g., Windows/Linux workstation/server, Microsoft
HPC Server cluster, Azure Burst, IBM Platform LSF
cluster)
 High performance computing capabilities
 Integrated Development Environment based on Visual
Studio technology (for Windows): the R Productivity
Environment (RPE)
Revolution R Enterprise 5.0 Webinar 3

High P erformanc e A nalytic s (HPA ) in
R evoS c aleR Revolution Confidential

 High Performance Computing + Data
 Full-featured, fast, and scalable analysis
functions
 Same code works on small and big data, and a
variety of data sources
 Same code works on a variety of compute
contexts - a laptop, server, cluster, or the cloud
 Scales approximately linearly with the number
of observations – without increasing memory
requirements

Revolution R Enterprise 4

R evoS c aleR : HPA A lgorithms Revolution Confidential

 Descriptive statistics (rxSummary)
 Tables and cubes (rxCube, rxCrossTabs)
 Correlations/covariances (rxCovCor, rxCor,
rxCov, rxSSCP)
 K means clustering (rxKmeans)
 Linear regressions (rxLinMod)
 Logistic regressions (rxLogit)
 Generalized Linear Models (rxGlm)
 Predictions (scoring) (rxPredict)
 Decision Trees (rxDTree) NEW!


Dec is ion Trees Revolution Confidential

 Relatively easy-to-interpret models
 Widely used in a variety of disciplines. For example,
 Predicting which patient characteristics are associated with
high risk of, for example, heart attack.
 Deciding whether or not to offer a loan to an individual
based on individual characteristics.
 Predicting the rate of return of various investment
strategies
 Retail target marketing
 Can handle multi-factor response easily
 Useful in identifying important interactions


Dec is ion Tree Types Revolution Confidential

 Classification tree: predict what ‘class’ or
‘group’ an observation belongs in
(dependent variable is a factor) for each
terminal node or leaf
 Regression tree: predict average value of
dependent variable for each terminal node
or leaf


S imple E xample: Marketing R es pons e Revolution Confidential

Data set containing the following information:
 Response: Was response to a phone call, email, or
mailing?
 Age
 Income
 Marital status
 Attended college?


S imple E xample: S pec ifying the model Revolution Confidential

treeOut <- rxDTree(response~ age
+ income + college + marital,
data = rdata)
where rdata is the name of the data set


S imple E xample: B as ic Output Revolution Confidential

 Information on the split, the number of observations in
the node, the number that match the y value, and the y
probabilities

1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000)
2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743)
4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639)
8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) *
9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) *
5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901)
10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054)
20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) *
21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) *
11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) *
3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) …


S imple E xample: Vis ual R epres entation Revolution Confidential

Root

No College
College

Age < 65 Age >=
Age >= 40 Age < 40 65: Mail
Single Married:
Age >= 65: Married:
Age < 65: Single Email
Mail Email
Phone
Age < 40 Age >=
40: Email
Income Income <
>= 30: 30: Email
Phone Income Income <
>= 30: 30: Email
Phone


S c aling HPA with R evoS c aleR Revolution Confidential

 RevoScaleR functions can read from data sets on disk in
chunks, so you can increase the number of observations in
the data set beyond what can be analyzed in memory all at
once
 RevoScaleR analysis functions process chunks of data in
parallel, taking greater advantage of your computing
resources (Parallel External Memory Algorithms)
 Multiple cores on a desktop/server
 Cluster/grids have added advantage of more hard drives
for storing & accessing data
 Windows HPC Server Cluster
 “Burst” computations to Azure in the cloud
 IBM Platform LSF Grid


T he ‘B ig Data’ Dec is ion Tree A lgorithm Revolution Confidential

 Classical algorithms for building a decision tree
sort all continuous variables in order to decide
where to split the data.
 This sorting step becomes time and memory
prohibitive when dealing with large data.
 rxDTree bins the data rather than sorting,
computing histograms to create empirical
distribution functions of the data
 rxDTree partitions the data horizontally, processing
in parallel different sets of observations


Us eful rxDTree A rguments for B ig Data
 cp: complexity parameter. Increasing cp will
decrease the number of splits attempted
 maxDepth: the maximum depth of any tree
node. The computations take much longer at
greater depth, so lowering maxDepth can
greatly speed up computation time.
 maxNumBins: the maximum number of bins
to use to cut numeric data. Decreasing
maxNumBins will speed up computation
time.

‘B ig Data’ E xample Revolution Confidential

CDC Report in Jan. 2012

15

T he U.S . B irth Data: 1985 - 2009 Revolution Confidential

 Public-use data sets containing information on
all births in the United States for each year from
1985 to 2009 are available to download:
http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
 “These natality files are gigantic; they’re
approximately 3.1 GB uncompressed. That’s a
little larger than R can easily process” – Joseph
Adler, R in a Nutshell
 I’ve imported key variables from each year into
a single .xdf file with over 100 million
observations.
16


R egres s ion Tree: Multiple B irths
Call:
rxDTree(formula = IsMultiple ~ DadAgeR8
+ MAGER + FRACEREC + FHISP_REC +
MRACEREC + MHISP_REC + DOB_YY,
data = birthAllC,
maxDepth = 6, cp = 1e-05,
blocksPerRead = 10, verbose = 1)
File:
C:RevolutionDataCDCBirthUS.xdf
Number of valid observations: 100672041
Number of missing observations: 0


L eaves with L owes t P erc ent of Multiple
B irths Revolution Confidential

Mom is not black and under the 1.3%
age of 20
Mom is Asian or Pacific Islander 1.6%
(and not Hispanic) and is between
22 and 28 years of age. The birth
is before 1997

Mom is black and under the age 1.7%
of 18

18

L eaves with Highes t P erc ent of Multiple
B irths Revolution Confidential

Mom is over 47 years old and 38.6%
the birth is after 1996
Mom is white, non-Hispanic, is 28.1%
between 45 and 47 years old,
and the birth is after 1996

Mom is Hispanic, is between 15.5%
45 and 47 years old, and the
birth is after 1996

19


P oll Ques tion
Are you using Hadoop?

R evoS c aleR with Hadoop Data F iles NE W Revolution Confidential

 The Hadoop Distributed File System (HDFS)
 is highly fault-tolerant and

 is designed to be deployed on low-cost
hardware.

 RevoScaleR supports accessing data in the
HDFS file system for import or for direct
analysis

21

R evoS c aleR Data S ourc es Revolution Confidential

 Data Sources can be used for import or directly for
analysis
 External: delimited text, fixed format text, SAS, SPSS,
ODBC connections
 Provided with RevoScaleR: efficient .xdf file format

 Data Sources contain information about their file
system
 Delimited text and .xdf data sources can both be used
with the HDFS file system

 Data sources are used as input to HPA functions

22

A n E xample Us ing Hadoop Data Revolution Confidential

 Hadoop cluster in our office
 Five nodes of commodity hardware
 Red Hat Enterprise Linux (RHEL) operating system
 Cloudera’s Hadoop (CDH3)
 Also has IBM Platform LSF workload management
system installed (not required to use HDFS data)
 My colleague, Dawn Kinsey, recorded a data
analysis session
 22 comma delimited files stored in HDFS
 Contain information on U.S. flight arrivals, 1997 – 2008


S teps in A nalys is Revolution Confidential

 Set up a ‘file system’ object and a ‘data source’
object
 Explore the HDFS airline data for the year 2000
directly
 Extract variables of interest from all the files into an
.xdf file in the native file system
 Use R’s great plotting capabilities on summary
information
 Perform a big logistic regression on an .xdf file
stored in HDFS



P oll Ques tion
What features of Revolution R
Enterprise 6.1 are most interesting
to you?

T hank You! Revolution Confidential

 Download slides, replay from today’s webinar
 http://bit.ly/QJfR4A
 Learn more about Revolution R Enterprise
 Overview: revolutionanalytics.com/products
 New feature videos:
http://www.revolutionanalytics.com/products/new-features.php

 Contact Revolution Analytics
 http://bit.ly/hey-revo

November 29: Real-Time Big Data Analytics: from Deployment
to Production
David Smith, VP Marketing and Community, Revolution Analytics

www.revolutionanalytics.com/news-events/free-webinars

26


The leading commercial provider of software and support for the
popular open source R statistics language.

www.revolutionanalytics.com
+1 (650) 646 9545
Twitter: @RevolutionR

27

New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Semelhante a New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data (20)

Mais de Revolution Analytics

Mais de Revolution Analytics (20)

New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data