Human computation, crowdsourcing and social: An industrial perspective
Intro to R and Data Mining 2012 09 27
1. Raj Kasarabada
- Sept 28th 2012
(Some material borrowed from: UCLA Academic Technology Services
Technical Report Series and presentations found online & Harry Potter websites)
Introduction to R and Data
Mining
Monday, September 01,
2014
1
2. What is R?
It is a programming language and a “lot like
magic...” except instead of spells you have
functions.
It is an open source software package.
Monday, September 01,
2014
2
3. =
wizardR users are like wizards. They can rely on functions (spells) that have been
developed for them by statistical researchers, but they can also create their
own.
They don’t have to pay for the use of them, and once experienced enough,
they are almost unlimited in their ability to change their environment.
Monday, September 01,
2014
3
4. History of R
• S: language for data analysis developed at Bell Labs
circa 1976
• Licensed by AT&T/Lucent to Insightful Corp. Product
name: S-plus.
• R: initially written & released as an open source
software by Ross Ihaka and Robert Gentleman at U
Auckland during 90s
• Since 1997: international R-core team ~15 people &
1000s of code writers and statisticians happy to share
their libraries! AWESOME!
Monday, September 01,
2014
4
5. So what is it?
•R is an interpreted computer language.
–Can interface procedures written in C, C+, or FORTRAN languages
for efficiency, and to write additional primitives.
–Can exchange data (XLS, CSV, RODBC, FOREIGN, mySQL)
•R is used for data manipulation, statistics, and graphics. It is
made up of:
–operators (+ - <- * %*% …) for calculations on arrays & matrices
–large, coherent, integrated collection of functions, graphics
–user written functions & sets of functions (packages); 800+
See Video “Intro to R” a typical session
Monday, September 01,
2014
5
7. Some examples…
1. One-way ANOVA to test the difference between
two (or more) group means.
• The output of our ANOVA test indicates that the difference between our
group means is statistically significant (p < .001).
• Conceptually, this suggests that employee attitudes
towards the experimental training program were
significantly higher than their attitudes towards the
preexisting program.
Monday, September 01,
2014
7
8. Some examples…
2. Two-way ANOVA
• The output of our ANOVA test indicates that the difference between our
treatment group means is statistically significant (p < .001) and that the
difference between genders is not significant (p = .585).
• Statistically significant interaction (treatment group & gender) p = .032
Monday, September 01,
2014
8
9. Some examples…
2. Anova with Categorical variables
• “How well does quarterback salary and conference predict total team
salary?“
• The output of our test indicates statistically significant (p < .001) for QB
but not significant (p = .91) for categorical variable.
• Considering both the counterintuitive and statistically insignificant results of this model, our
analysis of the conference variable would likely end or change directions at this point.
Monday, September 01,
2014
9
10. R vs SAS/SPSS
For the full comparison chart, see http://rforsasandspssusers.com/ by Bob Muenchen
Monday, September 01,
2014
10
11. Over 800 add-on packages
(http://cran.r-project.org/src/contrib/PACKAGES.html)
• This is an enormous advantage - new
techniques available without delay, and they
can be performed using the R language you
already know.
• Downside = as the number of packages grows,
it is becoming difficult to choose the best
package for your needs, & QC is an issue.
Monday, September 01,
2014
11
12. What is Data Mining? Analytics?
• Video by Davenport (Author of Book on
Analytics)
• What’s behind the increasing popularity of
data mining, and what is its relationship to
predictive analytics?
• Data mining is one of the components of BI spectrum and is
included under the umbrella of Advanced Analytics.
• When discussing data mining, predictive analytics is “applied data
mining”. Data mining is a set of technologies and algorithms and
that predictive analytics is the application of these technologies.
Monday, September 01,
2014
12
13. Definition of Data Mining
Data mining is the exploration and analysis of large quantities of data in order
to discover valid, novel, potentially useful, and ultimately understandable
patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the
patterns.
Data mining is the art and science of intelligent data analysis.
Monday, September 01,
2014
13
14. Case Study: Bank
1. Select subset of customer records who have received
home equity loan offer
– Customers who declined
– Customers who signed up
Income Number of
Children
Average Checking
Account Balance
… Reponse
$40,000 2 $1500 Yes
$75,000 0 $5000 No
$50,000 1 $3000 No
… … … … …
Monday, September 01,
2014
14
15. Case Study: Bank (Contd.)
2. Find rules to predict whether a customer would
respond to home equity loan offer
IF (Salary < 40k) and
(numChildren > 0) and
(ageChild1 > 18 and ageChild1 < 22)
THEN YES
…
Monday, September 01,
2014
15
16. Case Study: Bank (Contd.)
3. Group customers into clusters and
investigate clusters
Group 2
Group 3
Group 4
Group 1
Monday, September 01,
2014
16
17. Case Study: Bank (Contd.)
4. Evaluate results:
– Many “uninteresting” clusters
– One interesting cluster! Customers with both
business and personal accounts; unusually high
percentage of likely respondents
Action:
• New marketing campaign
Result:
• Acceptance rate for home equity offers more than
doubled
Monday, September 01,
2014
17
18. Example Application: Fraud
Detection
• Industries: Health care, retail, credit
card services, telecom, B2B
relationships
• Approach:
– Use historical data to build models of
fraudulent behavior
– Deploy models to identify fraudulent
instances
Monday, September 01,
2014
18
19. Fraud Detection (Contd.)
• Examples:
– Auto insurance: Detect groups of people who stage accidents to
collect insurance
– Medical insurance: Fraudulent claims
– Money laundering: Detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– Telecom industry: Find calling patterns that deviate from a norm
(origin and destination of the call, duration, time of day, day of
week).
Monday, September 01,
2014
19
22. Classification (Contd.)
• Decision trees are one approach to
classification.
• Other approaches include:
– Linear Discriminant Analysis
– k-nearest neighbor methods
– Logistic regression
– Neural networks
– Support Vector Machines
Monday, September 01,
2014
22
24. What are Decision Trees?
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YES
YES
NO
Minivan
Sports,
Truck
Monday, September 01,
2014
25
27. Market Basket Analysis
• Given:
– A database of customer
transactions
– Each transaction is a set
of items
• Goal:
– Extract rules
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
Monday, September 01,
2014
30
28. Market Basket Analysis (Contd.)
• Co-occurrences
– 80% of all customers purchase items X, Y
and Z together.
• Association rules
– 60% of all customers who purchase X and Y
also buy Z.
• Sequential patterns
– 60% of customers who first buy X also
purchase Y within three weeks.
Monday, September 01,
2014
31
31. Learning R and Rattle (data mining)
• Read through the CRAN website
• Use http://www.rseek.org/ instead of google
• Because R is interactive, errors are your friends!
• “Using R is a bit akin to smoking. The beginning is difficult, one
may get headaches and even gag the first few times. But in the
long run,it becomes pleasurable and even addictive”.
• It’s a journey of discovery !
Monday, September 01,
2014
34
32. All the best and Thank You!
Monday, September 01,
2014
35