SlideShare a Scribd company logo
1 of 32
Download to read offline
Raj Kasarabada
- Sept 28th 2012
(Some material borrowed from: UCLA Academic Technology Services
Technical Report Series and presentations found online & Harry Potter websites)
Introduction to R and Data
Mining
Monday, September 01,
2014
1
What is R?
It is a programming language and a “lot like
magic...” except instead of spells you have
functions.
It is an open source software package.
Monday, September 01,
2014
2
=
wizardR users are like wizards. They can rely on functions (spells) that have been
developed for them by statistical researchers, but they can also create their
own.
They don’t have to pay for the use of them, and once experienced enough,
they are almost unlimited in their ability to change their environment.
Monday, September 01,
2014
3
History of R
• S: language for data analysis developed at Bell Labs
circa 1976
• Licensed by AT&T/Lucent to Insightful Corp. Product
name: S-plus.
• R: initially written & released as an open source
software by Ross Ihaka and Robert Gentleman at U
Auckland during 90s
• Since 1997: international R-core team ~15 people &
1000s of code writers and statisticians happy to share
their libraries! AWESOME!
Monday, September 01,
2014
4
So what is it?
•R is an interpreted computer language.
–Can interface procedures written in C, C+, or FORTRAN languages
for efficiency, and to write additional primitives.
–Can exchange data (XLS, CSV, RODBC, FOREIGN, mySQL)
•R is used for data manipulation, statistics, and graphics. It is
made up of:
–operators (+ - <- * %*% …) for calculations on arrays & matrices
–large, coherent, integrated collection of functions, graphics
–user written functions & sets of functions (packages); 800+
See Video “Intro to R” a typical session
Monday, September 01,
2014
5
Learning R....
Monday, September 01,
2014
6
Some examples…
1. One-way ANOVA to test the difference between
two (or more) group means.
• The output of our ANOVA test indicates that the difference between our
group means is statistically significant (p < .001).
• Conceptually, this suggests that employee attitudes
towards the experimental training program were
significantly higher than their attitudes towards the
preexisting program.
Monday, September 01,
2014
7
Some examples…
2. Two-way ANOVA
• The output of our ANOVA test indicates that the difference between our
treatment group means is statistically significant (p < .001) and that the
difference between genders is not significant (p = .585).
• Statistically significant interaction (treatment group & gender) p = .032
Monday, September 01,
2014
8
Some examples…
2. Anova with Categorical variables
• “How well does quarterback salary and conference predict total team
salary?“
• The output of our test indicates statistically significant (p < .001) for QB
but not significant (p = .91) for categorical variable.
• Considering both the counterintuitive and statistically insignificant results of this model, our
analysis of the conference variable would likely end or change directions at this point.
Monday, September 01,
2014
9
R vs SAS/SPSS
For the full comparison chart, see http://rforsasandspssusers.com/ by Bob Muenchen
Monday, September 01,
2014
10
Over 800 add-on packages
(http://cran.r-project.org/src/contrib/PACKAGES.html)
• This is an enormous advantage - new
techniques available without delay, and they
can be performed using the R language you
already know.
• Downside = as the number of packages grows,
it is becoming difficult to choose the best
package for your needs, & QC is an issue.
Monday, September 01,
2014
11
What is Data Mining? Analytics?
• Video by Davenport (Author of Book on
Analytics)
• What’s behind the increasing popularity of
data mining, and what is its relationship to
predictive analytics?
• Data mining is one of the components of BI spectrum and is
included under the umbrella of Advanced Analytics.
• When discussing data mining, predictive analytics is “applied data
mining”. Data mining is a set of technologies and algorithms and
that predictive analytics is the application of these technologies.
Monday, September 01,
2014
12
Definition of Data Mining
Data mining is the exploration and analysis of large quantities of data in order
to discover valid, novel, potentially useful, and ultimately understandable
patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the
patterns.
Data mining is the art and science of intelligent data analysis.
Monday, September 01,
2014
13
Case Study: Bank
1. Select subset of customer records who have received
home equity loan offer
– Customers who declined
– Customers who signed up
Income Number of
Children
Average Checking
Account Balance
… Reponse
$40,000 2 $1500 Yes
$75,000 0 $5000 No
$50,000 1 $3000 No
… … … … …
Monday, September 01,
2014
14
Case Study: Bank (Contd.)
2. Find rules to predict whether a customer would
respond to home equity loan offer
IF (Salary < 40k) and
(numChildren > 0) and
(ageChild1 > 18 and ageChild1 < 22)
THEN YES
…
Monday, September 01,
2014
15
Case Study: Bank (Contd.)
3. Group customers into clusters and
investigate clusters
Group 2
Group 3
Group 4
Group 1
Monday, September 01,
2014
16
Case Study: Bank (Contd.)
4. Evaluate results:
– Many “uninteresting” clusters
– One interesting cluster! Customers with both
business and personal accounts; unusually high
percentage of likely respondents
Action:
• New marketing campaign
Result:
• Acceptance rate for home equity offers more than
doubled
Monday, September 01,
2014
17
Example Application: Fraud
Detection
• Industries: Health care, retail, credit
card services, telecom, B2B
relationships
• Approach:
– Use historical data to build models of
fraudulent behavior
– Deploy models to identify fraudulent
instances
Monday, September 01,
2014
18
Fraud Detection (Contd.)
• Examples:
– Auto insurance: Detect groups of people who stage accidents to
collect insurance
– Medical insurance: Fraudulent claims
– Money laundering: Detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– Telecom industry: Find calling patterns that deviate from a norm
(origin and destination of the call, duration, time of day, day of
week).
Monday, September 01,
2014
19
Data Mining Methods
Monday, September 01,
2014
20
Classification
Example application: telemarketing
Monday, September 01,
2014
21
Classification (Contd.)
• Decision trees are one approach to
classification.
• Other approaches include:
– Linear Discriminant Analysis
– k-nearest neighbor methods
– Logistic regression
– Neural networks
– Support Vector Machines
Monday, September 01,
2014
22
Decision Trees
Monday, September 01,
2014
24
What are Decision Trees?
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES
YES
NO
Minivan
Sports,
Truck
Monday, September 01,
2014
25
CLUSTERING
Monday, September 01,
2014
26
Market Basket Analysis:
Frequent Itemsets
Monday, September 01,
2014
29
Market Basket Analysis
• Given:
– A database of customer
transactions
– Each transaction is a set
of items
• Goal:
– Extract rules
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
Monday, September 01,
2014
30
Market Basket Analysis (Contd.)
• Co-occurrences
– 80% of all customers purchase items X, Y
and Z together.
• Association rules
– 60% of all customers who purchase X and Y
also buy Z.
• Sequential patterns
– 60% of customers who first buy X also
purchase Y within three weeks.
Monday, September 01,
2014
31
Some examples…
1. Example with Rattle and R
Monday, September 01,
2014
32
In summary!!!!
Monday, September 01,
2014
33
Learning R and Rattle (data mining)
• Read through the CRAN website
• Use http://www.rseek.org/ instead of google
• Because R is interactive, errors are your friends!
• “Using R is a bit akin to smoking. The beginning is difficult, one
may get headaches and even gag the first few times. But in the
long run,it becomes pleasurable and even addictive”.
• It’s a journey of discovery !
Monday, September 01,
2014
34
All the best and Thank You!
Monday, September 01,
2014
35

More Related Content

Similar to Intro to R and Data Mining 2012 09 27

Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisitedXavier Amatriain
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-surveyAdam Rabinovitch
 
Managerial Decision-Making
Managerial Decision-MakingManagerial Decision-Making
Managerial Decision-MakingLee Schlenker
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 
Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016Ray Poynter
 
Research and Community Building with a Roadmap
Research and Community Building with a RoadmapResearch and Community Building with a Roadmap
Research and Community Building with a RoadmapQuestionPro
 
201412 Predictive Analytics Foundation course extract
201412 Predictive Analytics Foundation course extract201412 Predictive Analytics Foundation course extract
201412 Predictive Analytics Foundation course extractJefferson Lynch
 
Managerial Decision Making
Managerial Decision MakingManagerial Decision Making
Managerial Decision MakingLee Schlenker
 
Finding and communicating the story in qualitative information - Lesson 2
Finding and communicating the story in qualitative information - Lesson 2Finding and communicating the story in qualitative information - Lesson 2
Finding and communicating the story in qualitative information - Lesson 2Ray Poynter
 
AMA Nebraska - SurveyMonkey (08-14)
AMA Nebraska  - SurveyMonkey (08-14)AMA Nebraska  - SurveyMonkey (08-14)
AMA Nebraska - SurveyMonkey (08-14)Brent Chudoba
 
Telling the Full Story: Adding Qualitative Data To Executive Dashboards
Telling the Full Story: Adding Qualitative Data To Executive DashboardsTelling the Full Story: Adding Qualitative Data To Executive Dashboards
Telling the Full Story: Adding Qualitative Data To Executive DashboardsUserZoom
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveoralonso
 

Similar to Intro to R and Data Mining 2012 09 27 (20)

"Big Data" and Business Analytics: Key Requirements for High Business Value R...
"Big Data" and Business Analytics: Key Requirements for High Business Value R..."Big Data" and Business Analytics: Key Requirements for High Business Value R...
"Big Data" and Business Analytics: Key Requirements for High Business Value R...
 
Decision making
Decision makingDecision making
Decision making
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Carpenter Library Assessment Conference Presentation
Carpenter Library Assessment Conference PresentationCarpenter Library Assessment Conference Presentation
Carpenter Library Assessment Conference Presentation
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-survey
 
Managerial Decision-Making
Managerial Decision-MakingManagerial Decision-Making
Managerial Decision-Making
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016Kevin Gray Festival of NewMR 2016
Kevin Gray Festival of NewMR 2016
 
Keynote Dubai
Keynote DubaiKeynote Dubai
Keynote Dubai
 
Fashiondatasc
FashiondatascFashiondatasc
Fashiondatasc
 
Research and Community Building with a Roadmap
Research and Community Building with a RoadmapResearch and Community Building with a Roadmap
Research and Community Building with a Roadmap
 
201412 Predictive Analytics Foundation course extract
201412 Predictive Analytics Foundation course extract201412 Predictive Analytics Foundation course extract
201412 Predictive Analytics Foundation course extract
 
Managerial Decision Making
Managerial Decision MakingManagerial Decision Making
Managerial Decision Making
 
Finding and communicating the story in qualitative information - Lesson 2
Finding and communicating the story in qualitative information - Lesson 2Finding and communicating the story in qualitative information - Lesson 2
Finding and communicating the story in qualitative information - Lesson 2
 
AMA Nebraska - SurveyMonkey (08-14)
AMA Nebraska  - SurveyMonkey (08-14)AMA Nebraska  - SurveyMonkey (08-14)
AMA Nebraska - SurveyMonkey (08-14)
 
Telling the Full Story: Adding Qualitative Data To Executive Dashboards
Telling the Full Story: Adding Qualitative Data To Executive DashboardsTelling the Full Story: Adding Qualitative Data To Executive Dashboards
Telling the Full Story: Adding Qualitative Data To Executive Dashboards
 
Research process
Research processResearch process
Research process
 
Carpenter - Lets remove "alt" from altmetrics - ER&L Presentation
Carpenter - Lets remove "alt" from altmetrics - ER&L PresentationCarpenter - Lets remove "alt" from altmetrics - ER&L Presentation
Carpenter - Lets remove "alt" from altmetrics - ER&L Presentation
 
Human computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspectiveHuman computation, crowdsourcing and social: An industrial perspective
Human computation, crowdsourcing and social: An industrial perspective
 

Intro to R and Data Mining 2012 09 27

  • 1. Raj Kasarabada - Sept 28th 2012 (Some material borrowed from: UCLA Academic Technology Services Technical Report Series and presentations found online & Harry Potter websites) Introduction to R and Data Mining Monday, September 01, 2014 1
  • 2. What is R? It is a programming language and a “lot like magic...” except instead of spells you have functions. It is an open source software package. Monday, September 01, 2014 2
  • 3. = wizardR users are like wizards. They can rely on functions (spells) that have been developed for them by statistical researchers, but they can also create their own. They don’t have to pay for the use of them, and once experienced enough, they are almost unlimited in their ability to change their environment. Monday, September 01, 2014 3
  • 4. History of R • S: language for data analysis developed at Bell Labs circa 1976 • Licensed by AT&T/Lucent to Insightful Corp. Product name: S-plus. • R: initially written & released as an open source software by Ross Ihaka and Robert Gentleman at U Auckland during 90s • Since 1997: international R-core team ~15 people & 1000s of code writers and statisticians happy to share their libraries! AWESOME! Monday, September 01, 2014 4
  • 5. So what is it? •R is an interpreted computer language. –Can interface procedures written in C, C+, or FORTRAN languages for efficiency, and to write additional primitives. –Can exchange data (XLS, CSV, RODBC, FOREIGN, mySQL) •R is used for data manipulation, statistics, and graphics. It is made up of: –operators (+ - <- * %*% …) for calculations on arrays & matrices –large, coherent, integrated collection of functions, graphics –user written functions & sets of functions (packages); 800+ See Video “Intro to R” a typical session Monday, September 01, 2014 5
  • 7. Some examples… 1. One-way ANOVA to test the difference between two (or more) group means. • The output of our ANOVA test indicates that the difference between our group means is statistically significant (p < .001). • Conceptually, this suggests that employee attitudes towards the experimental training program were significantly higher than their attitudes towards the preexisting program. Monday, September 01, 2014 7
  • 8. Some examples… 2. Two-way ANOVA • The output of our ANOVA test indicates that the difference between our treatment group means is statistically significant (p < .001) and that the difference between genders is not significant (p = .585). • Statistically significant interaction (treatment group & gender) p = .032 Monday, September 01, 2014 8
  • 9. Some examples… 2. Anova with Categorical variables • “How well does quarterback salary and conference predict total team salary?“ • The output of our test indicates statistically significant (p < .001) for QB but not significant (p = .91) for categorical variable. • Considering both the counterintuitive and statistically insignificant results of this model, our analysis of the conference variable would likely end or change directions at this point. Monday, September 01, 2014 9
  • 10. R vs SAS/SPSS For the full comparison chart, see http://rforsasandspssusers.com/ by Bob Muenchen Monday, September 01, 2014 10
  • 11. Over 800 add-on packages (http://cran.r-project.org/src/contrib/PACKAGES.html) • This is an enormous advantage - new techniques available without delay, and they can be performed using the R language you already know. • Downside = as the number of packages grows, it is becoming difficult to choose the best package for your needs, & QC is an issue. Monday, September 01, 2014 11
  • 12. What is Data Mining? Analytics? • Video by Davenport (Author of Book on Analytics) • What’s behind the increasing popularity of data mining, and what is its relationship to predictive analytics? • Data mining is one of the components of BI spectrum and is included under the umbrella of Advanced Analytics. • When discussing data mining, predictive analytics is “applied data mining”. Data mining is a set of technologies and algorithms and that predictive analytics is the application of these technologies. Monday, September 01, 2014 12
  • 13. Definition of Data Mining Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. Data mining is the art and science of intelligent data analysis. Monday, September 01, 2014 13
  • 14. Case Study: Bank 1. Select subset of customer records who have received home equity loan offer – Customers who declined – Customers who signed up Income Number of Children Average Checking Account Balance … Reponse $40,000 2 $1500 Yes $75,000 0 $5000 No $50,000 1 $3000 No … … … … … Monday, September 01, 2014 14
  • 15. Case Study: Bank (Contd.) 2. Find rules to predict whether a customer would respond to home equity loan offer IF (Salary < 40k) and (numChildren > 0) and (ageChild1 > 18 and ageChild1 < 22) THEN YES … Monday, September 01, 2014 15
  • 16. Case Study: Bank (Contd.) 3. Group customers into clusters and investigate clusters Group 2 Group 3 Group 4 Group 1 Monday, September 01, 2014 16
  • 17. Case Study: Bank (Contd.) 4. Evaluate results: – Many “uninteresting” clusters – One interesting cluster! Customers with both business and personal accounts; unusually high percentage of likely respondents Action: • New marketing campaign Result: • Acceptance rate for home equity offers more than doubled Monday, September 01, 2014 17
  • 18. Example Application: Fraud Detection • Industries: Health care, retail, credit card services, telecom, B2B relationships • Approach: – Use historical data to build models of fraudulent behavior – Deploy models to identify fraudulent instances Monday, September 01, 2014 18
  • 19. Fraud Detection (Contd.) • Examples: – Auto insurance: Detect groups of people who stage accidents to collect insurance – Medical insurance: Fraudulent claims – Money laundering: Detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) – Telecom industry: Find calling patterns that deviate from a norm (origin and destination of the call, duration, time of day, day of week). Monday, September 01, 2014 19
  • 20. Data Mining Methods Monday, September 01, 2014 20
  • 22. Classification (Contd.) • Decision trees are one approach to classification. • Other approaches include: – Linear Discriminant Analysis – k-nearest neighbor methods – Logistic regression – Neural networks – Support Vector Machines Monday, September 01, 2014 22
  • 24. What are Decision Trees? Minivan Age Car Type YES NO YES <30 >=30 Sports, Truck 0 30 60 Age YES YES NO Minivan Sports, Truck Monday, September 01, 2014 25
  • 26. Market Basket Analysis: Frequent Itemsets Monday, September 01, 2014 29
  • 27. Market Basket Analysis • Given: – A database of customer transactions – Each transaction is a set of items • Goal: – Extract rules TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice 4 Monday, September 01, 2014 30
  • 28. Market Basket Analysis (Contd.) • Co-occurrences – 80% of all customers purchase items X, Y and Z together. • Association rules – 60% of all customers who purchase X and Y also buy Z. • Sequential patterns – 60% of customers who first buy X also purchase Y within three weeks. Monday, September 01, 2014 31
  • 29. Some examples… 1. Example with Rattle and R Monday, September 01, 2014 32
  • 31. Learning R and Rattle (data mining) • Read through the CRAN website • Use http://www.rseek.org/ instead of google • Because R is interactive, errors are your friends! • “Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run,it becomes pleasurable and even addictive”. • It’s a journey of discovery ! Monday, September 01, 2014 34
  • 32. All the best and Thank You! Monday, September 01, 2014 35