Intro to R and Data Mining 2012 09 27

Raj Kasarabada
- Sept 28th 2012
(Some material borrowed from: UCLA Academic Technology Services
Technical Report Series and presentations found online & Harry Potter websites)
Introduction to R and Data
Mining
Monday, September 01,
2014
1

What is R?
It is a programming language and a “lot like
magic...” except instead of spells you have
functions.
It is an open source software package.
2014
2

=
wizardR users are like wizards. They can rely on functions (spells) that have been
developed for them by statistical researchers, but they can also create their
own.
They don’t have to pay for the use of them, and once experienced enough,
they are almost unlimited in their ability to change their environment.
2014
3

History of R
• S: language for data analysis developed at Bell Labs
circa 1976
• Licensed by AT&T/Lucent to Insightful Corp. Product
name: S-plus.
• R: initially written & released as an open source
software by Ross Ihaka and Robert Gentleman at U
Auckland during 90s
• Since 1997: international R-core team ~15 people &
1000s of code writers and statisticians happy to share
their libraries! AWESOME!
2014
4

So what is it?
•R is an interpreted computer language.
–Can interface procedures written in C, C+, or FORTRAN languages
for efficiency, and to write additional primitives.
–Can exchange data (XLS, CSV, RODBC, FOREIGN, mySQL)
•R is used for data manipulation, statistics, and graphics. It is
made up of:
–operators (+ - <- * %*% …) for calculations on arrays & matrices
–large, coherent, integrated collection of functions, graphics
–user written functions & sets of functions (packages); 800+
See Video “Intro to R” a typical session
2014
5

Learning R....
2014
6

Some examples…
1. One-way ANOVA to test the difference between
two (or more) group means.
• The output of our ANOVA test indicates that the difference between our
group means is statistically significant (p < .001).
• Conceptually, this suggests that employee attitudes
towards the experimental training program were
significantly higher than their attitudes towards the
preexisting program.
2014
7

Some examples…
2. Two-way ANOVA
• The output of our ANOVA test indicates that the difference between our
treatment group means is statistically significant (p < .001) and that the
difference between genders is not significant (p = .585).
• Statistically significant interaction (treatment group & gender) p = .032
2014
8

Some examples…
2. Anova with Categorical variables
• “How well does quarterback salary and conference predict total team
salary?“
• The output of our test indicates statistically significant (p < .001) for QB
but not significant (p = .91) for categorical variable.
• Considering both the counterintuitive and statistically insignificant results of this model, our
analysis of the conference variable would likely end or change directions at this point.
2014
9

R vs SAS/SPSS
For the full comparison chart, see http://rforsasandspssusers.com/ by Bob Muenchen
2014
10

Over 800 add-on packages
(http://cran.r-project.org/src/contrib/PACKAGES.html)
• This is an enormous advantage - new
techniques available without delay, and they
can be performed using the R language you
already know.
• Downside = as the number of packages grows,
it is becoming difficult to choose the best
package for your needs, & QC is an issue.
2014
11

What is Data Mining? Analytics?
• Video by Davenport (Author of Book on
Analytics)
• What’s behind the increasing popularity of
data mining, and what is its relationship to
predictive analytics?
• Data mining is one of the components of BI spectrum and is
included under the umbrella of Advanced Analytics.
• When discussing data mining, predictive analytics is “applied data
mining”. Data mining is a set of technologies and algorithms and
that predictive analytics is the application of these technologies.
2014
12

Definition of Data Mining
Data mining is the exploration and analysis of large quantities of data in order
to discover valid, novel, potentially useful, and ultimately understandable
patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the
patterns.
Data mining is the art and science of intelligent data analysis.
2014
13

Case Study: Bank
1. Select subset of customer records who have received
home equity loan offer
– Customers who declined
– Customers who signed up
Income Number of
Children
Average Checking
Account Balance
… Reponse
$40,000 2 $1500 Yes
$75,000 0 $5000 No
$50,000 1 $3000 No
… … … … …
2014
14

Case Study: Bank (Contd.)
2. Find rules to predict whether a customer would
respond to home equity loan offer
IF (Salary < 40k) and
(numChildren > 0) and
(ageChild1 > 18 and ageChild1 < 22)
THEN YES
…
2014
15

3. Group customers into clusters and
investigate clusters
Group 2
Group 3
Group 4
Group 1
2014
16

4. Evaluate results:
– Many “uninteresting” clusters
– One interesting cluster! Customers with both
business and personal accounts; unusually high
percentage of likely respondents
Action:
• New marketing campaign
Result:
• Acceptance rate for home equity offers more than
doubled
2014
17

Example Application: Fraud
Detection
• Industries: Health care, retail, credit
card services, telecom, B2B
relationships
• Approach:
– Use historical data to build models of
fraudulent behavior
– Deploy models to identify fraudulent
instances
2014
18

Fraud Detection (Contd.)
• Examples:
– Auto insurance: Detect groups of people who stage accidents to
collect insurance
– Medical insurance: Fraudulent claims
– Money laundering: Detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– Telecom industry: Find calling patterns that deviate from a norm
(origin and destination of the call, duration, time of day, day of
week).
2014
19

Data Mining Methods
2014
20

Classification
Example application: telemarketing
2014
21

Classification (Contd.)
• Decision trees are one approach to
classification.
• Other approaches include:
– Linear Discriminant Analysis
– k-nearest neighbor methods
– Logistic regression
– Neural networks
– Support Vector Machines
2014
22

Decision Trees
2014
24

What are Decision Trees?
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES
YES
NO
Minivan
Sports,
Truck
2014
25

CLUSTERING
2014
26

Market Basket Analysis:
Frequent Itemsets
2014
29

Market Basket Analysis
• Given:
– A database of customer
transactions
– Each transaction is a set
of items
• Goal:
– Extract rules
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
2014
30

Market Basket Analysis (Contd.)
• Co-occurrences
– 80% of all customers purchase items X, Y
and Z together.
• Association rules
– 60% of all customers who purchase X and Y
also buy Z.
• Sequential patterns
– 60% of customers who first buy X also
purchase Y within three weeks.
2014
31

Some examples…
1. Example with Rattle and R
2014
32

In summary!!!!
2014
33

Learning R and Rattle (data mining)
• Read through the CRAN website
• Use http://www.rseek.org/ instead of google
• Because R is interactive, errors are your friends!
• “Using R is a bit akin to smoking. The beginning is difficult, one
may get headaches and even gag the first few times. But in the
long run,it becomes pleasurable and even addictive”.
• It’s a journey of discovery !
2014
34

All the best and Thank You!
2014
35

Intro to R and Data Mining 2012 09 27

Recommended

Recommended

More Related Content

Similar to Intro to R and Data Mining 2012 09 27

Similar to Intro to R and Data Mining 2012 09 27 (20)

Intro to R and Data Mining 2012 09 27