SlideShare a Scribd company logo
1 of 39
Clean, Learn and
Visualise data
with R
Barbara Fusinska
@BasiaFusinska
About me
Programmer
Machine Learning
Data Solutions Architect
@BasiaFusinska
https://github.com/BasiaFusinska/RMachineLearning
Agenda
• Machine Learning
• R platform
• Machine Learning with R
• Classification problem
• Linear Regression
• Clustering
Machine Learning?
Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
Question:
What is the genre of
Gone with the wind
?
Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
Question:
What is the class of the entry
with the following features:
F1: 31, F2: 4
?
Data Visualization
0
10
20
30
40
50
60
0 10 20 30 40 50
Rule 1:
If on the left side of the
line then Class = A
Rule 2:
If on the right side of the
line then Class = B
A
B
Chick sexing
Supervised
learning
• Classification, regression
• Label, target value
• Training & Validation phases
Unsupervised
learning
• Clustering, feature selection
• Finding structure of data
• Statistical values describing the
data
R language
Why R?
• Ross Ihaka & Robert Gentleman
• Successor of S
• Open source
• Community driven
• #1 for statistical computing
• Exploratory Data Analysis
• Machine Learning
• Visualisation
Supervised Machine Learning workflow
Clean data Data split
Machine Learning
algorithm
Trained model Score
Preprocess
data
Training
data
Test data
Classification problem
Model training
Data & Labels
0
1
2
3
4
5
6
7
8
9
Data preparation
32 x 32
(0-1)
8 x 8
(0..16)
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
K-Nearest Neighbours Algorithm
• Object is classified by a majority
vote
• k – algorithm parameter
• Distance metrics: Euclidean
(continuous variables), Hamming
(text)
?
Evaluation methods for classification
Confusion
Matrix
Reference
Positive Negative
Prediction
Positive TP FP
Negative FN TN
Receiver Operating Characteristic
curve
Area under the curve
(AUC)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡
#𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
=
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁
𝑇𝑁 + 𝐹𝑁
How good at avoiding
false alarms
How good it is at
detecting positives
# Read data
trainingSet <- read.csv(trainingFile, header = FALSE)
testSet <- read.csv(testFile, header = FALSE)
trainingSet$V65 <- factor(trainingSet$V65)
testSet$V65 <- factor(testSet$V65)
# Classify
library(caret)
knn.fit <- knn3(V65 ~ ., data=trainingSet, k=5)
# Predict new values
pred.test <- predict(knn.fit, testSet[,1:64], type="class")
# Confusion matrix
library(caret)
confusionMatrix(pred.test, testSet[,65])
Regression problem
• Dependent value
• Predicting the real value
• Fitting the coefficients
• Analytical solutions
• Gradient descent
Ordinary linear regression
Residual sum of squares (RSS)
𝑆 𝛽 =
𝑖=1
𝑛
(𝑦𝑖 − 𝑥𝑖
𝑇
𝛽)2
= 𝑦 − 𝑋𝛽 𝑇
𝑦 − 𝑋𝛽
𝛽 = 𝑎𝑟𝑔 min
𝛽
𝑆(𝛽)
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
Evaluation methods for regression
• Errors
𝑅𝑀𝑆𝐸 = 𝑖=1
𝑛
(𝑓𝑖 − 𝑦𝑖)2
𝑛
𝑅2 = 1 −
(𝑓𝑖 − 𝑦𝑖)2
( 𝑦 − 𝑦𝑖)2
• Statistics (t, ANOVA)
Prestige dataset
Feature Data type Description
education continuous Average education (years)
income integer Average income (dollars)
women continuous Percentage of women
prestige continuous Pineo-Porter prestige score for
occupation
census integer Canadian Census occupational
code
type multi-valued
discrete
Type of occupation: bc, prof, wc
# Pairs for the numeric data
pairs(Prestige[,-c(5,6)], pch=21, bg=Prestige$type)
# Linear regression, numerical data
num.model <- lm(Prestige ~ education + log2(income) + women, prestige)
summary(num.model)
--------------------------------------------------
Call:
lm(formula = prestige ~ education + log2(income) + women, data = prestige)
Residuals:
Min 1Q Median 3Q Max
-17.364 -4.429 -0.101 4.316 19.179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -110.9658 14.8429 -7.476 3.27e-11 ***
education 3.7305 0.3544 10.527 < 2e-16 ***
log2(income) 9.3147 1.3265 7.022 2.90e-10 ***
women 0.0469 0.0299 1.568 0.12
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.093 on 98 degrees of freedom
Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
Regression
Plots
• Residuals vs Fitter
• Spot non-linear patterns
• Normal Q-Q
• Check normal distribution
• Scale – Location
• If residuals are spread
equally along the ranges of
predictors
• Residuals vs Leverage
• Find influential cases if any.
Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 0
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
# Linear regression, categorical variable
cat.model <- lm(prestige ~ education + log2(income) + type, prestige)
summary(cat.model)
--------------------------------------------------
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.2019 13.7431 -5.909 5.63e-08 ***
education 3.2845 0.6081 5.401 5.06e-07 ***
log2(income) 7.2694 1.1900 6.109 2.31e-08 ***
typeprof 6.7509 3.6185 1.866 0.0652 .
typewc -1.4394 2.3780 -0.605 0.5465
# Linear regression, categorical variable split
et.fit <- lm(prestige ~ type*education, prestige)
summary(et.fit)
--------------------------------------------------
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.2936 8.6470 -0.497 0.621
typeprof 18.8637 16.8881 1.117 0.267
typewc -24.3833 21.7777 -1.120 0.266
education 4.7637 1.0247 4.649 1.11e-05 ***
typeprof:education -0.9808 1.4495 -0.677 0.500
typewc:education 1.6709 2.0777 0.804 0.423
# Pairs for the numeric data
cf <- et.fit$coefficients
ggplot(prestige, aes(education, prestige)) + geom_point(aes(col=type)) +
geom_abline(slope=cf[4], intercept = cf[1], colour='red') +
geom_abline(slope=cf[4] + cf[5], intercept = cf[1] + cf[2], colour='green') +
geom_abline(slope=cf[4] + cf[6], intercept = cf[1] + cf[3], colour='blue')
Clustering problem
K-means Algorithm
Chicago crimes dataset
Data column Data type
ID Number
Case Number String
Arrest Boolean
Primary Type Enum
District Enum
DateFBI Code Enum
Longitude Numeric
Latitude Numeric
...
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
# Read data
crimeData <- read.csv(crimeFilePath)
# Only data with location, only Assault or Burglary types
crimeData <- crimeData[
!is.na(crimeData$Latitude) & !is.na(crimeData$Longitude),]
selectedCrimes <- subset(crimeData,
Primary.Type %in% c(crimeTypes[2], crimeTypes[4]))
# Visualise
library(ggplot2)
library(ggmap)
# Get map from Google
map_g <- get_map(location=c(lon=mean(crimeData$Longitude, na.rm=TRUE), lat=mean(
crimeData$Latitude, na.rm=TRUE)), zoom = 11, maptype = "terrain", scale = 2)
ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude,
fill = Primary.Type, alpha = 0.8), size = 1, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
Assault
& Burglary
# k-means clustering (k=6)
clusterResult <- kmeans(selectedCrimes[, c('Longitude', 'Latitude')], 6)
# Get the clusters information
centers <- as.data.frame(clusterResult$centers)
clusterColours <- factor(clusterResult$cluster)
# Visualise
ggmap(map_g) +
geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude,
alpha = 0.8, color = clusterColours), size = 1) +
geom_point(data = centers, aes(x = Longitude, y = Latitude,
alpha = 0.8), size = 1.5) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
Crimes
clusters
Keep in touch
BarbaraFusinska.com
@BasiaFusinska
https://github.com/BasiaFusinska/RMachineLearning

More Related Content

What's hot

Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
 
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...NAVER Engineering
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
 
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECFace recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECBAINIDA
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkAshutosh Trivedi
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregelSigmoid
 
Icann2018ppt final
Icann2018ppt finalIcann2018ppt final
Icann2018ppt finalDebasmit Das
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 

What's hot (20)

Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
Relational Calculus
Relational CalculusRelational Calculus
Relational Calculus
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
QBIC
QBICQBIC
QBIC
 
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECFace recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache Spark
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregel
 
Icann2018ppt final
Icann2018ppt finalIcann2018ppt final
Icann2018ppt final
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 

Similar to Clean, Learn and Visualise data with R

Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017Codemotion
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3leorick lin
 
Peterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_ProjectPeterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_Projectjpeterson2058
 
Row Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cRow Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cStew Ashton
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsBarbara Fusinska
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RStacy Irwin
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningBig_Data_Ukraine
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Craig Chao
 
Sparse Matrix and Polynomial
Sparse Matrix and PolynomialSparse Matrix and Polynomial
Sparse Matrix and PolynomialAroosa Rajput
 
Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesMax De Marzi
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with RDr. Volkan OBAN
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_TitanicAliciaWei1
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 

Similar to Clean, Learn and Visualise data with R (20)

Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
 
R and data mining
R and data miningR and data mining
R and data mining
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
 
Peterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_ProjectPeterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_Project
 
Row Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cRow Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12c
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commits
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
Sparse Matrix and Polynomial
Sparse Matrix and PolynomialSparse Matrix and Polynomial
Sparse Matrix and Polynomial
 
Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph Databases
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with R
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_Titanic
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 

More from Barbara Fusinska

Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowBarbara Fusinska
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportBarbara Fusinska
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportBarbara Fusinska
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportUsing Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportBarbara Fusinska
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska
 
A picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RA picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Breaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBreaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
How aspects clean your code
How aspects clean your codeHow aspects clean your code
How aspects clean your codeBarbara Fusinska
 
Architecture - why so serious?
Architecture - why so serious?Architecture - why so serious?
Architecture - why so serious?Barbara Fusinska
 

More from Barbara Fusinska (19)

Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with Kubeflow
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive Toolkit
 
Using Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportUsing Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical support
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
A picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RA picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Breaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBreaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.js
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
When the connection fails
When the connection failsWhen the connection fails
When the connection fails
 
When the connection fails
When the connection failsWhen the connection fails
When the connection fails
 
How aspects clean your code
How aspects clean your codeHow aspects clean your code
How aspects clean your code
 
Architecture - why so serious?
Architecture - why so serious?Architecture - why so serious?
Architecture - why so serious?
 

Recently uploaded

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 

Recently uploaded (20)

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

Clean, Learn and Visualise data with R

  • 1. Clean, Learn and Visualise data with R Barbara Fusinska @BasiaFusinska
  • 2. About me Programmer Machine Learning Data Solutions Architect @BasiaFusinska https://github.com/BasiaFusinska/RMachineLearning
  • 3. Agenda • Machine Learning • R platform • Machine Learning with R • Classification problem • Linear Regression • Clustering
  • 5. Movies Genres Title # Kisses # Kicks Genre Taken 3 47 Action Love story 24 2 Romance P.S. I love you 17 3 Romance Rush hours 5 51 Action Bad boys 7 42 Action Question: What is the genre of Gone with the wind ?
  • 6. Data-based classification Id Feature 1 Feature 2 Class 1. 3 47 A 2. 24 2 B 3. 17 3 B 4. 5 51 A 5. 7 42 A Question: What is the class of the entry with the following features: F1: 31, F2: 4 ?
  • 7. Data Visualization 0 10 20 30 40 50 60 0 10 20 30 40 50 Rule 1: If on the left side of the line then Class = A Rule 2: If on the right side of the line then Class = B A B
  • 9. Supervised learning • Classification, regression • Label, target value • Training & Validation phases
  • 10. Unsupervised learning • Clustering, feature selection • Finding structure of data • Statistical values describing the data
  • 12. Why R? • Ross Ihaka & Robert Gentleman • Successor of S • Open source • Community driven • #1 for statistical computing • Exploratory Data Analysis • Machine Learning • Visualisation
  • 13. Supervised Machine Learning workflow Clean data Data split Machine Learning algorithm Trained model Score Preprocess data Training data Test data
  • 14. Classification problem Model training Data & Labels 0 1 2 3 4 5 6 7 8 9
  • 15. Data preparation 32 x 32 (0-1) 8 x 8 (0..16) https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
  • 16. K-Nearest Neighbours Algorithm • Object is classified by a majority vote • k – algorithm parameter • Distance metrics: Euclidean (continuous variables), Hamming (text) ?
  • 17. Evaluation methods for classification Confusion Matrix Reference Positive Negative Prediction Positive TP FP Negative FN TN Receiver Operating Characteristic curve Area under the curve (AUC) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = #𝑐𝑜𝑟𝑟𝑒𝑐𝑡 #𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑁 How good at avoiding false alarms How good it is at detecting positives
  • 18. # Read data trainingSet <- read.csv(trainingFile, header = FALSE) testSet <- read.csv(testFile, header = FALSE) trainingSet$V65 <- factor(trainingSet$V65) testSet$V65 <- factor(testSet$V65) # Classify library(caret) knn.fit <- knn3(V65 ~ ., data=trainingSet, k=5) # Predict new values pred.test <- predict(knn.fit, testSet[,1:64], type="class")
  • 20. Regression problem • Dependent value • Predicting the real value • Fitting the coefficients • Analytical solutions • Gradient descent
  • 21. Ordinary linear regression Residual sum of squares (RSS) 𝑆 𝛽 = 𝑖=1 𝑛 (𝑦𝑖 − 𝑥𝑖 𝑇 𝛽)2 = 𝑦 − 𝑋𝛽 𝑇 𝑦 − 𝑋𝛽 𝛽 = 𝑎𝑟𝑔 min 𝛽 𝑆(𝛽) 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
  • 22. Evaluation methods for regression • Errors 𝑅𝑀𝑆𝐸 = 𝑖=1 𝑛 (𝑓𝑖 − 𝑦𝑖)2 𝑛 𝑅2 = 1 − (𝑓𝑖 − 𝑦𝑖)2 ( 𝑦 − 𝑦𝑖)2 • Statistics (t, ANOVA)
  • 23. Prestige dataset Feature Data type Description education continuous Average education (years) income integer Average income (dollars) women continuous Percentage of women prestige continuous Pineo-Porter prestige score for occupation census integer Canadian Census occupational code type multi-valued discrete Type of occupation: bc, prof, wc
  • 24. # Pairs for the numeric data pairs(Prestige[,-c(5,6)], pch=21, bg=Prestige$type)
  • 25. # Linear regression, numerical data num.model <- lm(Prestige ~ education + log2(income) + women, prestige) summary(num.model) -------------------------------------------------- Call: lm(formula = prestige ~ education + log2(income) + women, data = prestige) Residuals: Min 1Q Median 3Q Max -17.364 -4.429 -0.101 4.316 19.179 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -110.9658 14.8429 -7.476 3.27e-11 *** education 3.7305 0.3544 10.527 < 2e-16 *** log2(income) 9.3147 1.3265 7.022 2.90e-10 *** women 0.0469 0.0299 1.568 0.12 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.093 on 98 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.83 F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
  • 26. Regression Plots • Residuals vs Fitter • Spot non-linear patterns • Normal Q-Q • Check normal distribution • Scale – Location • If residuals are spread equally along the ranges of predictors • Residuals vs Leverage • Find influential cases if any.
  • 27. Categorical data for regression • Categories: A, B, C are coded as dummy variables • In general if the variable has k categories it will be decoded into k-1 dummy variables Category V1 V2 A 0 0 B 1 0 C 0 0 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
  • 28. # Linear regression, categorical variable cat.model <- lm(prestige ~ education + log2(income) + type, prestige) summary(cat.model) -------------------------------------------------- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -81.2019 13.7431 -5.909 5.63e-08 *** education 3.2845 0.6081 5.401 5.06e-07 *** log2(income) 7.2694 1.1900 6.109 2.31e-08 *** typeprof 6.7509 3.6185 1.866 0.0652 . typewc -1.4394 2.3780 -0.605 0.5465
  • 29. # Linear regression, categorical variable split et.fit <- lm(prestige ~ type*education, prestige) summary(et.fit) -------------------------------------------------- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.2936 8.6470 -0.497 0.621 typeprof 18.8637 16.8881 1.117 0.267 typewc -24.3833 21.7777 -1.120 0.266 education 4.7637 1.0247 4.649 1.11e-05 *** typeprof:education -0.9808 1.4495 -0.677 0.500 typewc:education 1.6709 2.0777 0.804 0.423
  • 30. # Pairs for the numeric data cf <- et.fit$coefficients ggplot(prestige, aes(education, prestige)) + geom_point(aes(col=type)) + geom_abline(slope=cf[4], intercept = cf[1], colour='red') + geom_abline(slope=cf[4] + cf[5], intercept = cf[1] + cf[2], colour='green') + geom_abline(slope=cf[4] + cf[6], intercept = cf[1] + cf[3], colour='blue')
  • 33. Chicago crimes dataset Data column Data type ID Number Case Number String Arrest Boolean Primary Type Enum District Enum DateFBI Code Enum Longitude Numeric Latitude Numeric ... https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
  • 34. # Read data crimeData <- read.csv(crimeFilePath) # Only data with location, only Assault or Burglary types crimeData <- crimeData[ !is.na(crimeData$Latitude) & !is.na(crimeData$Longitude),] selectedCrimes <- subset(crimeData, Primary.Type %in% c(crimeTypes[2], crimeTypes[4])) # Visualise library(ggplot2) library(ggmap) # Get map from Google map_g <- get_map(location=c(lon=mean(crimeData$Longitude, na.rm=TRUE), lat=mean( crimeData$Latitude, na.rm=TRUE)), zoom = 11, maptype = "terrain", scale = 2) ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude, fill = Primary.Type, alpha = 0.8), size = 1, shape = 21) + guides(fill=FALSE, alpha=FALSE, size=FALSE)
  • 36. # k-means clustering (k=6) clusterResult <- kmeans(selectedCrimes[, c('Longitude', 'Latitude')], 6) # Get the clusters information centers <- as.data.frame(clusterResult$centers) clusterColours <- factor(clusterResult$cluster) # Visualise ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude, alpha = 0.8, color = clusterColours), size = 1) + geom_point(data = centers, aes(x = Longitude, y = Latitude, alpha = 0.8), size = 1.5) + guides(fill=FALSE, alpha=FALSE, size=FALSE)
  • 38.