This document provides an introduction to using R for data mining. It discusses R being a full programming language and home to many data mining algorithms. The webinar aims to convince attendees that R is a serious platform for data mining. It covers getting started with R, popular machine learning functions and packages, and running example code. The document also discusses working with big data using RevoScaleR and Revolution R Enterprise.
DevoxxFR 2024 Reproducible Builds with Apache Maven
Introduction to R for Data Mining (Feb 2013)
1. Revolution Confidential
Introduc tion to R
for
Data Mining
2013 Webinar S eries
J os eph B . R ic kert
F ebruary 14, 2013
1
2. F irs t P olling Ques tion Revolution Confidential
What is your favorite data mining software
tool?
1. R
2. SAS
3. MapReduce
4. Weka
5. Other
2
3. My goal for today’s webinar is to c onvinc e
you that: Revolution Confidential
Seriously,
it is not difficult
to learn enough
R to do some
serious data
mining
R
is a serious Revolution R
platform Enterprise
for is the platform
data mining for
serious data
mining
3
4. Revolution Confidential
A word about Data Mining
We assume that you know a little
bit about data mining and this is
your context for learning R
4
5. Applications Actions Algorithms Data Mining
Revolution Confidential
Credit Scoring Acquire Data CART
Fraud Detection Prepare Random Forests
Ad Optimization Classify SVM
Targeted
Predict KMeans
Marketing
Hierarchical
Gene Detection Visualize
clustering
Recommendation Ensemble
Optimize
systems Techniques
Social Networks Interpret
5
7. Is : Revolution Confidential
The way to do statistical computing
A full blown programming language
The home of nearly every data mining
algorithm known to data science.
A vibrant world-wide community
Since 1997 a core
R was written in early
1990’s by group of ~ 20
Robert developers guides
Gentleman the evolution of the
Ross Ihaka language
7
8. is organized into libraries of
func tions c alled pac kages Revolution Confidential
R Package Growth
4,332 packages as of 2/13/13
CRAN R download
Base
Recommended packages
User contributed packages
8
9. F inding Your Way A round world of Revolution Confidential
Machine Learning
Data Mining
Visualization
Finding Packages
Task Views
crantastic.org
Blogs
Revolutions
R-Bloggers
Quick-R
Inside-R
Getting Help
Finding R People
User Groups worldwide
Twitter : #rstats
9
11. L earning R ? Revolution Confidential
Levels of R Skill
Write production grade code R developer
Write an R package R contributor
Write code and algorithms R programmer
Use R functions R user
Use a GUI R aware
10 10,000
Hours of use
The Malcolm Gladwell “Outlier” Scale
11
12. B as ic Mac hine L earning F unc tions Revolution Confidential
Function Library Description
Cluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifiers glm stats Logistic Regression
rpart rpart Recursive partitioning and
regression trees
ksvm kernlab Support Vector Machine
apriori arules Rule based classification
Ensemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and
regression
12
13. Noteworthy Data Mining P ac kages Revolution Confidential
Package Comment
caret Well organized and remarkably complete
collection of functions to facilitate model
building for regression and classification
problems
rattle A very intuitive GUI for data mining that
produces useful R code
13
14. Revolution Confidential
Script
1 GETTING STARTED .R
2 ROLL with RATTLE .R
3 IN THE TREES . R
4 INTRO to CARET .R
5 BIG DATA with RevoScaleR .R
6 WORDCLOUD .R
Doing a lot with a little R
T IME TO R UN S OME C ODE
The R Scripts are available at:
https://gist.github.com/joseph-rickert/4742529
14
15. S ec ond P olling Ques tion Revolution Confidential
What are your favorite data mining
techniques?
1. Clustering techniques such as K-means
2. Single model classifiers such as decision trees,
or SVMs
3. Ensemble classifiers such as Random Forests
or boosting models
4. Text mining techniques
5. Other
15
16. T hird P olling Ques tion
(ins ert after running s c ript IN T HE T R E E S
Revolution Confidential
What kind of data do you analyze?
1. Financial data
2. Customer data (e.g. for recommendations)
3. Website data (e.g. for ads)
4. Health Care data
5. Other
16
18. Too B ig for Open S ourc e R Revolution Confidential
mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000)
model <- glm(default ~ .,data=mortDF,family="binomial")
18
19. R evoS c aleR brings the power of
B ig Data to R Revolution Confidential
Parallel External
Abstracted layer for
Memory Algorithms
providing
that are distributed
Communications communication
among available Distributed
Framework between compute
compute resources Statistical
Algorithms nodes in a cluster
(cores & computers)
(MPI, MapReduce, In-
independent of
Database)
platform
API for integrating
external data
R Language
sources (files, Interface Familiar, high-
databases, HDFS) Data Source prodictivity
that provides API programming
optimized reading of paradigm for R users
rows and columns in
blocks
19
20. R evoS c aleR P E MA s
P arallel E xternal Memory A lgorithms Revolution Confidential
XDF File
Read blocks and compute R based algorithms
Block 1 intermediate results in
parallel, iterating as Work on blocks of data
Inherently parallel and
Block 1
necessary results
distributed
Block 2
Block i results
Block i Block Block Do not require all data
to be in memory at one
Block i Block i+1 Block i+2
i+1 i+2 results
results
Block
i+1 Results from last
time
block
Can deal with distributed
Block
i+2
1st pass
and streaming data
2nd pass
3rd pass
20
22. C ontinuing to L earn R Revolution Confidential
Resources Examples
RevoJoe: How to Learn R Thomson Nguyen on the Heritage
Health Prize
More R Documentation Shannon Terry & Ben Ogorek
The R Journal (Nationwide Insurance): A Direct
Marketing In-Flight Forecasting
Books System
Reference Card and more Jeffrey Breen:
Mining Twitter for Airline Consumer
Classes Sentiment
Coursera Joe Rothermich: Alternative Data
Sources for Measuring Market
Revolution Analytics Sentiment and Events (Using R)
22