This document provides an overview of machine learning concepts and how to apply them using the R programming language. It introduces machine learning tasks like classification, regression, and clustering. For classification, it demonstrates linear regression on a movie genres dataset and k-nearest neighbors on a handwritten digits dataset. For regression, it fits linear models to a prestige dataset. For clustering, it applies k-means to group crimes in Chicago neighborhoods. Visualizations are shown throughout using R packages like ggplot and caret.
5. Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
Question:
What is the genre of
Gone with the wind
?
6. Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
Question:
What is the class of the entry
with the following features:
F1: 31, F2: 4
?
12. Why R?
• Ross Ihaka & Robert Gentleman
• Successor of S
• Open source
• Community driven
• #1 for statistical computing
• Exploratory Data Analysis
• Machine Learning
• Visualisation
13. Supervised Machine Learning workflow
Clean data Data split
Machine Learning
algorithm
Trained model Score
Preprocess
data
Training
data
Test data
23. Prestige dataset
Feature Data type Description
education continuous Average education (years)
income integer Average income (dollars)
women continuous Percentage of women
prestige continuous Pineo-Porter prestige score for
occupation
census integer Canadian Census occupational
code
type multi-valued
discrete
Type of occupation: bc, prof, wc
24. # Pairs for the numeric data
pairs(Prestige[,-c(5,6)], pch=21, bg=Prestige$type)
25. # Linear regression, numerical data
num.model <- lm(Prestige ~ education + log2(income) + women, prestige)
summary(num.model)
--------------------------------------------------
Call:
lm(formula = prestige ~ education + log2(income) + women, data = prestige)
Residuals:
Min 1Q Median 3Q Max
-17.364 -4.429 -0.101 4.316 19.179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -110.9658 14.8429 -7.476 3.27e-11 ***
education 3.7305 0.3544 10.527 < 2e-16 ***
log2(income) 9.3147 1.3265 7.022 2.90e-10 ***
women 0.0469 0.0299 1.568 0.12
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.093 on 98 degrees of freedom
Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
26. Regression
Plots
• Residuals vs Fitter
• Spot non-linear patterns
• Normal Q-Q
• Check normal distribution
• Scale – Location
• If residuals are spread
equally along the ranges of
predictors
• Residuals vs Leverage
• Find influential cases if any.
27. Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 0
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
33. Chicago crimes dataset
Data column Data type
ID Number
Case Number String
Arrest Boolean
Primary Type Enum
District Enum
DateFBI Code Enum
Longitude Numeric
Latitude Numeric
...
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
34. # Read data
crimeData <- read.csv(crimeFilePath)
# Only data with location, only Assault or Burglary types
crimeData <- crimeData[
!is.na(crimeData$Latitude) & !is.na(crimeData$Longitude),]
selectedCrimes <- subset(crimeData,
Primary.Type %in% c(crimeTypes[2], crimeTypes[4]))
# Visualise
library(ggplot2)
library(ggmap)
# Get map from Google
map_g <- get_map(location=c(lon=mean(crimeData$Longitude, na.rm=TRUE), lat=mean(
crimeData$Latitude, na.rm=TRUE)), zoom = 11, maptype = "terrain", scale = 2)
ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude,
fill = Primary.Type, alpha = 0.8), size = 1, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)