Predicting California’s Ecoregions

PREDICTING CALIFORNIA’S
ECOREGIONS:
A COMPARISON OF
SVM AND DECISION TREE
APPROACHES
Douglas Callaway
CP5605: Advanced Data Mining and Knowledge Discovery
28 October,
2015

MOTIVATION
Explore popular classification algorithms
Gain experience in R programming language
Apply skills to an interesting case study

PURPOSE
Develop 2 models to predict Californian ecoregions
Optimise each model’s parameters
Develop an unbiased sampling technique
Assess & compare each model’s performance using cross-
validation

APPROACHES & JUSTIFICATION
Decision Tree
Numeric or nominal predictions
One or more categories
Intuitive rules
Unlimited models possible
Support Vector Machine
Numeric or nominal predictions
One or more categories
Abstract hyperplane class seperation
One optimal model per training set
http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_

CASE STUDY:
PREDICTING CALIFORNIA’S
ECOREGIONS
12 distinct regions
Determined by EPA in 2013
“Denote areas of general similarity in ecosystems
and in the type, quality, and quantity of
environmental resources.” (U.S. Environmental
Protection Agency, 2013)

CALIFORNIA WEATHER
DATASET
30 years of weather data (1983-2012)
400 weather stations
Some issues with missing data

PRE-PROCESSING
Data consolidation
Cleaning
Aggregation
Feature selection
474,294 Original
Records

HANDLING MISSING VALUES:
SPATIAL INTERPOLATION
Nulls estimated from values of nearby measurements
Closer neighbors given higher weight
Inverse Distance Weighting (IDW)

FEATURE SELECTION
Provided Attributes
Location (latitude, longitude,
elevation)
Monthly temperature
 Extreme min/max
 Monthly average
Monthly precipitation
Calculated Attributes
Distance to coast
30-year temperature
 Extreme min/max
Average annual precipitation
55 attributes in total!

50
55
60
65
70
75
80
85
5 25 45 65 85
Accuracy(%)
Test Set Size (%)
SVM Accuracy vs. Cross Validation Test Set
Size
1
8
20
30
40
50
60
70
80
3 23 43 63 83Accuracy(%)
Test Set Size (%)
Decision Tree Accuracy vs. Cross Validation
Test Set Size
Underfit
Overfit
1
8
CHOOSING APPROPRIATE TEST
SET SIZES
Overfit
Underfit

UNBIASED SAMPLING:
CHOOSING DISPERSED TEST
LOCATIONS
Weather stations spatially clustered around
populated areas
n random locations generated within study area
Weather stations nearest random points chosen
for test set

ANALYSIS: SVM TUNING
*Using R ‘svm() {e1071}’ and ‘tune() {e1071}'
functions

ANALYSIS: DECISION TREE TUNING
*Using R ‘rpart() {rpart}’ and ‘tune() {e1071}'
functions

RESULTS: DECISION TREE
CONFUSION MATRIX
TRUE
Total
Accuracy:
Ecoregion 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8
P
R
E
D
I
C
T
E
D
10.1.3 0 0 0 0 0 0 0 0 0 0 0 0
10.1.5 0 0 0 0 0 0 0 0 1 1 0 0
10.2.1 0 1 8 1 0 1 0 0 0 0 0 0
10.2.2 0 0 0 5 0 0 0 0 0 0 0 0
11.1.1 0 0 0 0 13 0 0 0 1 0 0 2
11.1.2 0 0 1 0 1 8 0 0 0 0 0 0
11.1.3 0 0 0 0 0 0 2 0 0 0 0 0
6.2.11 0 0 0 0 1 0 0 3 0 3 1 0
6.2.12 0 4 1 0 0 0 0 0 5 0 0 0
6.2.7 0 0 0 0 0 0 0 1 0 1 1 0
6.2.8 1 0 0 0 0 0 0 0 0 0 2 0
7.1.8 0 0 0 0 2 0 0 0 0 0 0 0
Accuracy/Class: 0.0% 0.0% 80.0% 83.3% 76.5% 88.9% 100.0% 75.0% 71.4% 20.0% 50.0% 0.0% 65.3%

TRUE
Total
Accuracy
:
Ecoregion 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8
P
R
E
D
I
C
T
E
D
10.1.3 1 0 0 0 0 0 0 0 0 0 1 0
10.1.5 0 2 0 0 0 0 0 0 0 0 0 0
10.2.1 0 1 10 0 0 0 0 0 0 0 0 0
10.2.2 0 0 0 6 0 0 0 0 0 0 0 0
11.1.1 0 0 0 0 13 0 0 0 0 0 0 1
11.1.2 0 0 0 0 0 9 0 0 0 0 0 0
11.1.3 0 0 0 0 1 0 2 0 0 0 0 0
6.2.11 0 0 0 0 1 0 0 4 0 0 1 0
6.2.12 0 1 0 0 1 0 0 0 7 1 0 0
6.2.7 0 1 0 0 0 0 0 0 0 4 0 0
6.2.8 0 0 0 0 0 0 0 0 0 0 2 0
7.1.8 0 0 0 0 1 0 0 0 0 0 0 1
Accuracy/Class: 100.0% 40.0% 100.0% 100.0% 76.5% 100.0% 100.0% 100.0% 100.0% 80.0% 50.0% 50.0% 84.7%
RESULTS: SVM CONFUSION
MATRIX

Ecoregion: 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8 TOTAL:
SVM 100.0% 40.0% 100.0% 100.0% 76.5% 100.0% 100.0% 100.0% 100.0% 80.0% 50.0% 50.0% 84.7%
Decision Tree 0.0% 0.0% 80.0% 83.3% 76.5% 88.9% 100.0% 75.0% 71.4% 20.0% 50.0% 0.0% 65.3%
10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8
SVM Decision Tree
SVM VS DECISION TREE

CONCLUSIONS & DISCUSSION
Locations near borders difficult to classify
Decision tree model more intuitive, less accurate
SVM good at separating classes (high dimensionality)
Applications
Assess climate change risk
 E.g. what areas are likely to change if temperatures increase?
Find similar climates elsewhere
 E.g. what other areas will support a crop normally grown in California?

Predicting California’s Ecoregions

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Predicting California’s Ecoregions

Semelhante a Predicting California’s Ecoregions (20)

Predicting California’s Ecoregions