SlideShare a Scribd company logo
1 of 50
Download to read offline
IntroductionwithExamples
Eoin Brazil - https://github.com/braz/DublinR-ML-treesandforests
Machine Learning Techniques in R
A bit of context around ML
How can you interpret their results?
A few techniques to improve prediction / reduce over-fitting
Kaggle & similar competitions - using ML for fun & profit
Nuts & Bolts - 4 data sets and 6 techniques
A brief tour of some useful data handling / formatting tools
2/50
Types of Learning
3/50
A bit of context around ML
4/50
5/50
Model Building Process
6/50
Model Selection and Model Assessment
7/50
Model Choice - Move from Adaptability to
Simplicity
8/50
Interpreting A Confusion Matrix
9/50
Interpreting A Confusion Matrix Example
10/50
Confusion Matrix - Calculations
11/50
Interpreting A ROC Plot
A point in this plot is better than another if it is
to the northwest (TPR higher / FPR lower / or
both)
``Conservatives'' - on LHS and near the X-
axis - only make positive classification with
strong evidence and making few FP errors
but low TP rates
``Liberals'' - on upper RHS - make positive
classifications with weak evidence so nearly
all positives identified however high FP rates
·
·
·
12/50
ROC Dangers
13/50
Addressing Prediction Error
K-fold Cross-Validation (e.g. 10-fold)
Bootstrapping, draw B random samples with
replacement from data set to create B
bootstrapped data sets with same size as
original. These are used as training sets with
the original used as the test set.
Other variations on above:
·
Allows for averaging the error across the
models
-
·
·
Repeated cross validation
The '.632' bootstrap
-
-
14/50
Addressing Feature Selection
15/50
Kaggle - using ML for fun & profit
16/50
Nuts & Bolts - Data sets and Techniques
17/50
18/50
Aside - How does associative analysis work ?
19/50
What are they good for ?
Marketing Survey Data - Part 1
20/50
Marketing Survey Data - Part 2
21/50
Aside - How do decision trees work ?
22/50
What are they good for ?
Car Insurance Policy Exposure Management - Part 1
Analysing insurance claim details of 67856
policies taken out in 2004 and 2005.
The model maps each record into one of X
mutually exclusive terminal nodes or groups.
These groups are represented by their
average response, where the node number is
treated as the data group.
The binary claim indicator uses 6 variables to
determine a probability estimate for each
terminal node determine if a insurance
policyholder will claim on their policy.
·
·
·
·
23/50
Car Insurance Policy Exposure Management - Part 2
Root node, splits the data set on 'agecat'
Younger drivers to the left (1-8) and older
drivers (9-11) to right
N9 splits on basis of vehicle value
N10 <= $28.9k giving 15k records and 5.4%
of claims
N11 > $28.9k+ giving 1.9k records and 8.5%
of claims
Left Split from Root, N2 splits on vehicle body
type, on age (N4), then on vehicle value (N6)
The n value = num of overall population and
the y value = probability of claim from a driver
in that group
·
·
·
·
·
·
·
24/50
What are they good for ?
Cancer Research Screening - Part 1
Hill et al (2007), models how well cells within
an image are segmented, 61 vars with 2019
obs (Training = 1009 & Test = 1010).
·
"Impact of image segmentation on high-
content screening data quality for SK-
BR-3 cells, Andrew A Hill, Peter LaPan,
Yizheng Li and Steve Haney, BMC
Bioinformatics 2007, 8:340".
b, Well-Segmented (WS)
c, WS (e.g. complete nucleus and
cytoplasmic region)
d, Poorly-Segmented (PS)
e, PS (e.g. partial match/es)
-
-
-
-
-
25/50
Cancer Research Screening Dataset - Part 2
26/50
"prp(rpartTune$finalModel)" "fancyRpartPlot(rpartTune$finalModel)"
Cancer Research Screening - Part 3
27/50
Cancer Research Screening - Part 4
28/50
Cancer Research Screening Dataset - Part 5
29/50
What are they good for ?
Predicting the Quality of Wine - Part 1
Cortez et al (2009), models the quality of
wines (Vinho Verde), 14 vars with 4898 obs
(Training = 5199 & Test = 1298).
"Modeling wine preferences by data mining
from physicochemical properties, P. Cortez,
A. Cerdeira, F. Almeida, T. Matos and J.
Reis, Decision Support Systems 2009,
47(4):547-553".
·
·
Good (quality score is >= 6)
Bad (quality score is < 6)
-
-
##
## Bad Good
## 476 822
30/50
Predicting the Quality of Wine - Part 2
31/50
Predicting the Quality of Wine - Part 3
32/50
Predicting the Quality of Wine - Part 4 - Problems with Trees
Deal with irrelevant inputs
No data preprocessing required
Scalable computation (fast to build)
Tolerant with missing values (little loss of
accuracy)
Only a few tunable parameters (easy to learn)
Allows for human understandable graphic
representation
·
·
·
·
·
·
Data fragmentation for high-dimensional
sparse data set (over-fitting)
Difficult to fit to a trend / piece-wise constant
model
Highly influenced by changes to the data set
and local optima (deep trees might be
questionable as the errors propagate down)
·
·
·
33/50
Aside - How does a random forest work ?
34/50
Predicting the Quality of Wine - Part 5 - Random Forest
35/50
Predicting the Quality of Wine - Part 6 - Other ML methods
K-nearest neighbors
Neural Nets
·
Unsupervised learning / non-target
based learning
Distance matrix / cluster analysis using
Euclidean distances.
-
-
·
Looking at basic feed forward simple 3-
layer network (input, 'processing', output)
Each node / neuron is a set of numerical
parameters / weights tuned by the
learning algorithm used
-
-
Support Vector Machines·
Supervised learning
non-probabilistic binary linear classifier /
nonlinear classifiers by applying the
kernel trick
constructs a hyper-plane/s in a high-
dimensional space
-
-
-
36/50
Aside - How does k nearest neighbors work ?
37/50
Predicting the Quality of Wine - Part 7 - kNN
38/50
Aside - How do neural networks work ?
39/50
Predicting the Quality of Wine - Part 8 - NNET
40/50
Aside - How do support vector machines work
?
41/50
Predicting the Quality of Wine - Part 9 - SVN
42/50
Predicting the Quality of Wine - Part 10 - All Results
43/50
What are they not good for ?
Predicting the Extramarital Affairs
Fair, R.C. et al (1978), models the possibility
of affairs, 9 vars with 601 obs (Training = 481
& Test = 120).
"A Theory of Extramarital Affairs, Fair, R.C.,
Journal of Political Economy 1978, 86:45-61".
·
·
Yes (affairs is >= 1 in last 6 months)
No (affairs is < 1 in last 6 months)
-
-
##
## No Yes
## 90 30
44/50
Extramarital Dataset
45/50
Random Forest Naive Bayes
Predicting the Extramarital Affairs - RF & NB
## Reference
## Prediction No Yes
## No 90 30
## Yes 0 0
## Accuracy
## 0.75
## Reference
## Prediction No Yes
## No 88 29
## Yes 2 1
## Accuracy
## 0.75
46/50
Other related tools: OpenRefine (formerly
Google Refine) / Rattle
47/50
Other related tools: Command Line Utilities
http://www.gregreda.com/2013/07/15/unix-
commands-for-data-science/
http://blog.comsysto.com/2013/04/25/data-
analysis-with-the-unix-shell/
·
sed / awk
head / tail
wc (word count)
grep
sort / uniq
-
-
-
-
-
·
join
Gnuplot
-
-
http://jeroenjanssens.com/2013/09/19/seven-
command-line-tools-for-data-science.html
·
http://csvkit.readthedocs.org/en/latest/
https://github.com/jehiah/json2csv
http://stedolan.github.io/jq/
https://github.com/jeroenjanssens/data-
science-toolbox/blob/master/sample
https://github.com/bitly/data_hacks
https://github.com/jeroenjanssens/data-
science-toolbox/blob/master/Rio
https://github.com/parmentf/xml2json
-
-
-
-
-
-
-
48/50
A (incomplete) tour of the packages in R
caret
party
rpart
rpart.plot
AppliedPredictiveModeling
randomForest
corrplot
arules
arulesViz
·
·
·
·
·
·
·
·
·
C50
pROC
corrplot
kernlab
rattle
RColorBrewer
corrgram
ElemStatLearn
car
·
·
·
·
·
·
·
·
·
49/50
In Summary
An idea of some of the types of classifiers available in ML.
What a confusion matrix and ROC means for a classifier and how to
interpret them
An idea of how to test a set of techniques and parameters to help you find
the best model for your data
Slides, Data, Scripts are all on GH:
https://github.com/braz/DublinR-ML-treesandforests
50/50

More Related Content

Similar to Introduction to Machine Learning using R - Dublin R User Group - Oct 2013

Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selectionAndrea Dal Pozzolo
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...RAHUL WAGAJ
 
Test Optimization With Design of Experiment
Test Optimization With Design of ExperimentTest Optimization With Design of Experiment
Test Optimization With Design of Experimentajitbkulkarni
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 InternshipTaylor Martell
 
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...Chakkrit (Kla) Tantithamthavorn
 
Presentation of the unbalanced R package
Presentation of the unbalanced R packagePresentation of the unbalanced R package
Presentation of the unbalanced R packageAndrea Dal Pozzolo
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...
A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...
A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...Hengzhe Zhang
 
Echelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopEchelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopGarrett Teoh Hor Keong
 
Identify Root Causes – C&E Matrix
Identify Root Causes – C&E MatrixIdentify Root Causes – C&E Matrix
Identify Root Causes – C&E MatrixMatt Hansen
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithmHoopeer Hoopeer
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
 
As mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docxAs mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docxfredharris32
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryKelvinNMhina
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Dennis Sweitzer
 
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICS
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICSCOMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICS
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICScscpconf
 
Aviation Accident Investigation Using Machine Learning
Aviation Accident Investigation Using Machine LearningAviation Accident Investigation Using Machine Learning
Aviation Accident Investigation Using Machine LearningIndraneel Dabhade
 
A software fault localization technique based on program mutations
A software fault localization technique based on program mutationsA software fault localization technique based on program mutations
A software fault localization technique based on program mutationsTao He
 
Rubric Detail. Dissociative Disorders Select Grid Vie.docx
Rubric Detail. Dissociative Disorders        Select Grid Vie.docxRubric Detail. Dissociative Disorders        Select Grid Vie.docx
Rubric Detail. Dissociative Disorders Select Grid Vie.docxWilheminaRossi174
 

Similar to Introduction to Machine Learning using R - Dublin R User Group - Oct 2013 (20)

Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...
 
Test Optimization With Design of Experiment
Test Optimization With Design of ExperimentTest Optimization With Design of Experiment
Test Optimization With Design of Experiment
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
 
Presentation of the unbalanced R package
Presentation of the unbalanced R packagePresentation of the unbalanced R package
Presentation of the unbalanced R package
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...
A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...
A Double Lexicase Selection Operator for Bloat Control in Evolutionary Featur...
 
Echelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy WorkshopEchelon Asia Summit 2017 Startup Academy Workshop
Echelon Asia Summit 2017 Startup Academy Workshop
 
Identify Root Causes – C&E Matrix
Identify Root Causes – C&E MatrixIdentify Root Causes – C&E Matrix
Identify Root Causes – C&E Matrix
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithm
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
 
As mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docxAs mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docx
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies Summary
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2
 
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICS
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICSCOMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICS
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICS
 
Aviation Accident Investigation Using Machine Learning
Aviation Accident Investigation Using Machine LearningAviation Accident Investigation Using Machine Learning
Aviation Accident Investigation Using Machine Learning
 
A software fault localization technique based on program mutations
A software fault localization technique based on program mutationsA software fault localization technique based on program mutations
A software fault localization technique based on program mutations
 
Rubric Detail. Dissociative Disorders Select Grid Vie.docx
Rubric Detail. Dissociative Disorders        Select Grid Vie.docxRubric Detail. Dissociative Disorders        Select Grid Vie.docx
Rubric Detail. Dissociative Disorders Select Grid Vie.docx
 

More from Eoin Brazil

Cloud Computing Examples at ICHEC
Cloud Computing Examples at ICHECCloud Computing Examples at ICHEC
Cloud Computing Examples at ICHECEoin Brazil
 
An example of discovering simple patterns using basic data mining
An example of discovering simple patterns using basic data miningAn example of discovering simple patterns using basic data mining
An example of discovering simple patterns using basic data miningEoin Brazil
 
Bringing HPC to tackle your business problems
Bringing HPC to tackle your business problemsBringing HPC to tackle your business problems
Bringing HPC to tackle your business problemsEoin Brazil
 
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...Eoin Brazil
 
Example optimisation using GPGPUs by ICHEC
Example optimisation using GPGPUs by ICHECExample optimisation using GPGPUs by ICHEC
Example optimisation using GPGPUs by ICHECEoin Brazil
 
Ichec is vs-andthecloud
Ichec is vs-andthecloudIchec is vs-andthecloud
Ichec is vs-andthecloudEoin Brazil
 
Mixing Interaction, Sonification, Rendering and Design - The art of creating ...
Mixing Interaction, Sonification, Rendering and Design - The art of creating ...Mixing Interaction, Sonification, Rendering and Design - The art of creating ...
Mixing Interaction, Sonification, Rendering and Design - The art of creating ...Eoin Brazil
 
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009Eoin Brazil
 
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009Eoin Brazil
 
Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009Eoin Brazil
 
Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...
Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...
Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...Eoin Brazil
 
IOTC08 The Arduino Platform
IOTC08 The Arduino PlatformIOTC08 The Arduino Platform
IOTC08 The Arduino PlatformEoin Brazil
 
Arduino Lecture 2 - Electronic, LEDs, Communications and Datasheets
Arduino Lecture 2 - Electronic, LEDs, Communications and DatasheetsArduino Lecture 2 - Electronic, LEDs, Communications and Datasheets
Arduino Lecture 2 - Electronic, LEDs, Communications and DatasheetsEoin Brazil
 
Arduino Lecture 1 - Introducing the Arduino
Arduino Lecture 1 - Introducing the ArduinoArduino Lecture 1 - Introducing the Arduino
Arduino Lecture 1 - Introducing the ArduinoEoin Brazil
 
Arduino Lecture 3 - Making Things Move and AVR programming
Arduino Lecture 3 - Making Things Move and AVR programmingArduino Lecture 3 - Making Things Move and AVR programming
Arduino Lecture 3 - Making Things Move and AVR programmingEoin Brazil
 

More from Eoin Brazil (15)

Cloud Computing Examples at ICHEC
Cloud Computing Examples at ICHECCloud Computing Examples at ICHEC
Cloud Computing Examples at ICHEC
 
An example of discovering simple patterns using basic data mining
An example of discovering simple patterns using basic data miningAn example of discovering simple patterns using basic data mining
An example of discovering simple patterns using basic data mining
 
Bringing HPC to tackle your business problems
Bringing HPC to tackle your business problemsBringing HPC to tackle your business problems
Bringing HPC to tackle your business problems
 
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
 
Example optimisation using GPGPUs by ICHEC
Example optimisation using GPGPUs by ICHECExample optimisation using GPGPUs by ICHEC
Example optimisation using GPGPUs by ICHEC
 
Ichec is vs-andthecloud
Ichec is vs-andthecloudIchec is vs-andthecloud
Ichec is vs-andthecloud
 
Mixing Interaction, Sonification, Rendering and Design - The art of creating ...
Mixing Interaction, Sonification, Rendering and Design - The art of creating ...Mixing Interaction, Sonification, Rendering and Design - The art of creating ...
Mixing Interaction, Sonification, Rendering and Design - The art of creating ...
 
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
 
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 3 - Interactive Media CS4062 Semester 2 2009
 
Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009
Arduino Lecture 2 - Interactive Media CS4062 Semester 2 2009
 
Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...
Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...
Echoes, Whispers, and Footsteps from the Conflux of Sonic Interaction Design ...
 
IOTC08 The Arduino Platform
IOTC08 The Arduino PlatformIOTC08 The Arduino Platform
IOTC08 The Arduino Platform
 
Arduino Lecture 2 - Electronic, LEDs, Communications and Datasheets
Arduino Lecture 2 - Electronic, LEDs, Communications and DatasheetsArduino Lecture 2 - Electronic, LEDs, Communications and Datasheets
Arduino Lecture 2 - Electronic, LEDs, Communications and Datasheets
 
Arduino Lecture 1 - Introducing the Arduino
Arduino Lecture 1 - Introducing the ArduinoArduino Lecture 1 - Introducing the Arduino
Arduino Lecture 1 - Introducing the Arduino
 
Arduino Lecture 3 - Making Things Move and AVR programming
Arduino Lecture 3 - Making Things Move and AVR programmingArduino Lecture 3 - Making Things Move and AVR programming
Arduino Lecture 3 - Making Things Move and AVR programming
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Introduction to Machine Learning using R - Dublin R User Group - Oct 2013

  • 1. IntroductionwithExamples Eoin Brazil - https://github.com/braz/DublinR-ML-treesandforests
  • 2. Machine Learning Techniques in R A bit of context around ML How can you interpret their results? A few techniques to improve prediction / reduce over-fitting Kaggle & similar competitions - using ML for fun & profit Nuts & Bolts - 4 data sets and 6 techniques A brief tour of some useful data handling / formatting tools 2/50
  • 4. A bit of context around ML 4/50
  • 7. Model Selection and Model Assessment 7/50
  • 8. Model Choice - Move from Adaptability to Simplicity 8/50
  • 10. Interpreting A Confusion Matrix Example 10/50
  • 11. Confusion Matrix - Calculations 11/50
  • 12. Interpreting A ROC Plot A point in this plot is better than another if it is to the northwest (TPR higher / FPR lower / or both) ``Conservatives'' - on LHS and near the X- axis - only make positive classification with strong evidence and making few FP errors but low TP rates ``Liberals'' - on upper RHS - make positive classifications with weak evidence so nearly all positives identified however high FP rates · · · 12/50
  • 14. Addressing Prediction Error K-fold Cross-Validation (e.g. 10-fold) Bootstrapping, draw B random samples with replacement from data set to create B bootstrapped data sets with same size as original. These are used as training sets with the original used as the test set. Other variations on above: · Allows for averaging the error across the models - · · Repeated cross validation The '.632' bootstrap - - 14/50
  • 16. Kaggle - using ML for fun & profit 16/50
  • 17. Nuts & Bolts - Data sets and Techniques 17/50
  • 18. 18/50
  • 19. Aside - How does associative analysis work ? 19/50
  • 20. What are they good for ? Marketing Survey Data - Part 1 20/50
  • 21. Marketing Survey Data - Part 2 21/50
  • 22. Aside - How do decision trees work ? 22/50
  • 23. What are they good for ? Car Insurance Policy Exposure Management - Part 1 Analysing insurance claim details of 67856 policies taken out in 2004 and 2005. The model maps each record into one of X mutually exclusive terminal nodes or groups. These groups are represented by their average response, where the node number is treated as the data group. The binary claim indicator uses 6 variables to determine a probability estimate for each terminal node determine if a insurance policyholder will claim on their policy. · · · · 23/50
  • 24. Car Insurance Policy Exposure Management - Part 2 Root node, splits the data set on 'agecat' Younger drivers to the left (1-8) and older drivers (9-11) to right N9 splits on basis of vehicle value N10 <= $28.9k giving 15k records and 5.4% of claims N11 > $28.9k+ giving 1.9k records and 8.5% of claims Left Split from Root, N2 splits on vehicle body type, on age (N4), then on vehicle value (N6) The n value = num of overall population and the y value = probability of claim from a driver in that group · · · · · · · 24/50
  • 25. What are they good for ? Cancer Research Screening - Part 1 Hill et al (2007), models how well cells within an image are segmented, 61 vars with 2019 obs (Training = 1009 & Test = 1010). · "Impact of image segmentation on high- content screening data quality for SK- BR-3 cells, Andrew A Hill, Peter LaPan, Yizheng Li and Steve Haney, BMC Bioinformatics 2007, 8:340". b, Well-Segmented (WS) c, WS (e.g. complete nucleus and cytoplasmic region) d, Poorly-Segmented (PS) e, PS (e.g. partial match/es) - - - - - 25/50
  • 26. Cancer Research Screening Dataset - Part 2 26/50
  • 28. Cancer Research Screening - Part 4 28/50
  • 29. Cancer Research Screening Dataset - Part 5 29/50
  • 30. What are they good for ? Predicting the Quality of Wine - Part 1 Cortez et al (2009), models the quality of wines (Vinho Verde), 14 vars with 4898 obs (Training = 5199 & Test = 1298). "Modeling wine preferences by data mining from physicochemical properties, P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Decision Support Systems 2009, 47(4):547-553". · · Good (quality score is >= 6) Bad (quality score is < 6) - - ## ## Bad Good ## 476 822 30/50
  • 31. Predicting the Quality of Wine - Part 2 31/50
  • 32. Predicting the Quality of Wine - Part 3 32/50
  • 33. Predicting the Quality of Wine - Part 4 - Problems with Trees Deal with irrelevant inputs No data preprocessing required Scalable computation (fast to build) Tolerant with missing values (little loss of accuracy) Only a few tunable parameters (easy to learn) Allows for human understandable graphic representation · · · · · · Data fragmentation for high-dimensional sparse data set (over-fitting) Difficult to fit to a trend / piece-wise constant model Highly influenced by changes to the data set and local optima (deep trees might be questionable as the errors propagate down) · · · 33/50
  • 34. Aside - How does a random forest work ? 34/50
  • 35. Predicting the Quality of Wine - Part 5 - Random Forest 35/50
  • 36. Predicting the Quality of Wine - Part 6 - Other ML methods K-nearest neighbors Neural Nets · Unsupervised learning / non-target based learning Distance matrix / cluster analysis using Euclidean distances. - - · Looking at basic feed forward simple 3- layer network (input, 'processing', output) Each node / neuron is a set of numerical parameters / weights tuned by the learning algorithm used - - Support Vector Machines· Supervised learning non-probabilistic binary linear classifier / nonlinear classifiers by applying the kernel trick constructs a hyper-plane/s in a high- dimensional space - - - 36/50
  • 37. Aside - How does k nearest neighbors work ? 37/50
  • 38. Predicting the Quality of Wine - Part 7 - kNN 38/50
  • 39. Aside - How do neural networks work ? 39/50
  • 40. Predicting the Quality of Wine - Part 8 - NNET 40/50
  • 41. Aside - How do support vector machines work ? 41/50
  • 42. Predicting the Quality of Wine - Part 9 - SVN 42/50
  • 43. Predicting the Quality of Wine - Part 10 - All Results 43/50
  • 44. What are they not good for ? Predicting the Extramarital Affairs Fair, R.C. et al (1978), models the possibility of affairs, 9 vars with 601 obs (Training = 481 & Test = 120). "A Theory of Extramarital Affairs, Fair, R.C., Journal of Political Economy 1978, 86:45-61". · · Yes (affairs is >= 1 in last 6 months) No (affairs is < 1 in last 6 months) - - ## ## No Yes ## 90 30 44/50
  • 46. Random Forest Naive Bayes Predicting the Extramarital Affairs - RF & NB ## Reference ## Prediction No Yes ## No 90 30 ## Yes 0 0 ## Accuracy ## 0.75 ## Reference ## Prediction No Yes ## No 88 29 ## Yes 2 1 ## Accuracy ## 0.75 46/50
  • 47. Other related tools: OpenRefine (formerly Google Refine) / Rattle 47/50
  • 48. Other related tools: Command Line Utilities http://www.gregreda.com/2013/07/15/unix- commands-for-data-science/ http://blog.comsysto.com/2013/04/25/data- analysis-with-the-unix-shell/ · sed / awk head / tail wc (word count) grep sort / uniq - - - - - · join Gnuplot - - http://jeroenjanssens.com/2013/09/19/seven- command-line-tools-for-data-science.html · http://csvkit.readthedocs.org/en/latest/ https://github.com/jehiah/json2csv http://stedolan.github.io/jq/ https://github.com/jeroenjanssens/data- science-toolbox/blob/master/sample https://github.com/bitly/data_hacks https://github.com/jeroenjanssens/data- science-toolbox/blob/master/Rio https://github.com/parmentf/xml2json - - - - - - - 48/50
  • 49. A (incomplete) tour of the packages in R caret party rpart rpart.plot AppliedPredictiveModeling randomForest corrplot arules arulesViz · · · · · · · · · C50 pROC corrplot kernlab rattle RColorBrewer corrgram ElemStatLearn car · · · · · · · · · 49/50
  • 50. In Summary An idea of some of the types of classifiers available in ML. What a confusion matrix and ROC means for a classifier and how to interpret them An idea of how to test a set of techniques and parameters to help you find the best model for your data Slides, Data, Scripts are all on GH: https://github.com/braz/DublinR-ML-treesandforests 50/50