SlideShare uma empresa Scribd logo
1 de 18
PREDICTING CALIFORNIA’S
ECOREGIONS:
A COMPARISON OF
SVM AND DECISION TREE
APPROACHES
Douglas Callaway
CP5605: Advanced Data Mining and Knowledge Discovery
28 October,
2015
MOTIVATION
Explore popular classification algorithms
Gain experience in R programming language
Apply skills to an interesting case study
PURPOSE
Develop 2 models to predict Californian ecoregions
Optimise each model’s parameters
Develop an unbiased sampling technique
Assess & compare each model’s performance using cross-
validation
APPROACHES & JUSTIFICATION
Decision Tree
Numeric or nominal predictions
One or more categories
Intuitive rules
Unlimited models possible
Support Vector Machine
Numeric or nominal predictions
One or more categories
Abstract hyperplane class seperation
One optimal model per training set
http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_
CASE STUDY:
PREDICTING CALIFORNIA’S
ECOREGIONS
12 distinct regions
Determined by EPA in 2013
“Denote areas of general similarity in ecosystems
and in the type, quality, and quantity of
environmental resources.” (U.S. Environmental
Protection Agency, 2013)
CALIFORNIA WEATHER
DATASET
30 years of weather data (1983-2012)
400 weather stations
Some issues with missing data
PRE-PROCESSING
Data consolidation
Cleaning
Aggregation
Feature selection
474,294 Original
Records
HANDLING MISSING VALUES:
SPATIAL INTERPOLATION
Nulls estimated from values of nearby measurements
Closer neighbors given higher weight
Inverse Distance Weighting (IDW)
FEATURE SELECTION
Provided Attributes
Location (latitude, longitude,
elevation)
Monthly temperature
 Extreme min/max
 Monthly average
Monthly precipitation
Calculated Attributes
Distance to coast
30-year temperature
 Extreme min/max
Average annual precipitation
55 attributes in total!
FEATURE SELECTION
50
55
60
65
70
75
80
85
5 25 45 65 85
Accuracy(%)
Test Set Size (%)
SVM Accuracy vs. Cross Validation Test Set
Size
1
8
20
30
40
50
60
70
80
3 23 43 63 83Accuracy(%)
Test Set Size (%)
Decision Tree Accuracy vs. Cross Validation
Test Set Size
Underfit
Overfit
1
8
CHOOSING APPROPRIATE TEST
SET SIZES
Overfit
Underfit
UNBIASED SAMPLING:
CHOOSING DISPERSED TEST
LOCATIONS
Weather stations spatially clustered around
populated areas
n random locations generated within study area
Weather stations nearest random points chosen
for test set
ANALYSIS: SVM TUNING
*Using R ‘svm() {e1071}’ and ‘tune() {e1071}'
functions
ANALYSIS: DECISION TREE TUNING
*Using R ‘rpart() {rpart}’ and ‘tune() {e1071}'
functions
RESULTS: DECISION TREE
CONFUSION MATRIX
TRUE
Total
Accuracy:
Ecoregion 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8
P
R
E
D
I
C
T
E
D
10.1.3 0 0 0 0 0 0 0 0 0 0 0 0
10.1.5 0 0 0 0 0 0 0 0 1 1 0 0
10.2.1 0 1 8 1 0 1 0 0 0 0 0 0
10.2.2 0 0 0 5 0 0 0 0 0 0 0 0
11.1.1 0 0 0 0 13 0 0 0 1 0 0 2
11.1.2 0 0 1 0 1 8 0 0 0 0 0 0
11.1.3 0 0 0 0 0 0 2 0 0 0 0 0
6.2.11 0 0 0 0 1 0 0 3 0 3 1 0
6.2.12 0 4 1 0 0 0 0 0 5 0 0 0
6.2.7 0 0 0 0 0 0 0 1 0 1 1 0
6.2.8 1 0 0 0 0 0 0 0 0 0 2 0
7.1.8 0 0 0 0 2 0 0 0 0 0 0 0
Accuracy/Class: 0.0% 0.0% 80.0% 83.3% 76.5% 88.9% 100.0% 75.0% 71.4% 20.0% 50.0% 0.0% 65.3%
TRUE
Total
Accuracy
:
Ecoregion 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8
P
R
E
D
I
C
T
E
D
10.1.3 1 0 0 0 0 0 0 0 0 0 1 0
10.1.5 0 2 0 0 0 0 0 0 0 0 0 0
10.2.1 0 1 10 0 0 0 0 0 0 0 0 0
10.2.2 0 0 0 6 0 0 0 0 0 0 0 0
11.1.1 0 0 0 0 13 0 0 0 0 0 0 1
11.1.2 0 0 0 0 0 9 0 0 0 0 0 0
11.1.3 0 0 0 0 1 0 2 0 0 0 0 0
6.2.11 0 0 0 0 1 0 0 4 0 0 1 0
6.2.12 0 1 0 0 1 0 0 0 7 1 0 0
6.2.7 0 1 0 0 0 0 0 0 0 4 0 0
6.2.8 0 0 0 0 0 0 0 0 0 0 2 0
7.1.8 0 0 0 0 1 0 0 0 0 0 0 1
Accuracy/Class: 100.0% 40.0% 100.0% 100.0% 76.5% 100.0% 100.0% 100.0% 100.0% 80.0% 50.0% 50.0% 84.7%
RESULTS: SVM CONFUSION
MATRIX
Ecoregion: 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8 TOTAL:
SVM 100.0% 40.0% 100.0% 100.0% 76.5% 100.0% 100.0% 100.0% 100.0% 80.0% 50.0% 50.0% 84.7%
Decision Tree 0.0% 0.0% 80.0% 83.3% 76.5% 88.9% 100.0% 75.0% 71.4% 20.0% 50.0% 0.0% 65.3%
10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8
SVM Decision Tree
SVM VS DECISION TREE
CONCLUSIONS & DISCUSSION
Locations near borders difficult to classify
Decision tree model more intuitive, less accurate
SVM good at separating classes (high dimensionality)
Applications
Assess climate change risk
 E.g. what areas are likely to change if temperatures increase?
Find similar climates elsewhere
 E.g. what other areas will support a crop normally grown in California?

Mais conteúdo relacionado

Semelhante a Predicting California’s Ecoregions

An acoustic approach for multiple fault diagnosis in motorcycles
An acoustic approach for multiple fault diagnosis in motorcyclesAn acoustic approach for multiple fault diagnosis in motorcycles
An acoustic approach for multiple fault diagnosis in motorcycles
Ramesh Wadawadagi
 
LVTS - Image Resolution Monitor for Litho-Metrology
LVTS - Image Resolution Monitor for Litho-MetrologyLVTS - Image Resolution Monitor for Litho-Metrology
LVTS - Image Resolution Monitor for Litho-Metrology
Vladislav Kaplan
 
Final HPC Poster Compact
Final HPC Poster CompactFinal HPC Poster Compact
Final HPC Poster Compact
Angelica Kiser
 

Semelhante a Predicting California’s Ecoregions (20)

Gluecon 2013 Keynote Ravello Systems
Gluecon 2013 Keynote Ravello SystemsGluecon 2013 Keynote Ravello Systems
Gluecon 2013 Keynote Ravello Systems
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_Jie
 
Hairong Qi V Swaminathan
Hairong Qi V SwaminathanHairong Qi V Swaminathan
Hairong Qi V Swaminathan
 
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
 
Universal approximators for Direct Policy Search in multi-purpose water reser...
Universal approximators for Direct Policy Search in multi-purpose water reser...Universal approximators for Direct Policy Search in multi-purpose water reser...
Universal approximators for Direct Policy Search in multi-purpose water reser...
 
A Cooperative Coevolutionary Approach to Maximise Surveillance Coverage of UA...
A Cooperative Coevolutionary Approach to Maximise Surveillance Coverage of UA...A Cooperative Coevolutionary Approach to Maximise Surveillance Coverage of UA...
A Cooperative Coevolutionary Approach to Maximise Surveillance Coverage of UA...
 
Fuzzy Control meets Software Engineering
Fuzzy Control meets Software EngineeringFuzzy Control meets Software Engineering
Fuzzy Control meets Software Engineering
 
An acoustic approach for multiple fault diagnosis in motorcycles
An acoustic approach for multiple fault diagnosis in motorcyclesAn acoustic approach for multiple fault diagnosis in motorcycles
An acoustic approach for multiple fault diagnosis in motorcycles
 
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
 
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
MODELING THE CHLOROPHYLL-A FROM SEA SURFACE REFLECTANCE IN WEST AFRICA BY DEE...
 
LVTS - Image Resolution Monitor for Litho-Metrology
LVTS - Image Resolution Monitor for Litho-MetrologyLVTS - Image Resolution Monitor for Litho-Metrology
LVTS - Image Resolution Monitor for Litho-Metrology
 
Mod Sim for AR Rearden JUN18 2.pdf
Mod Sim for AR Rearden JUN18 2.pdfMod Sim for AR Rearden JUN18 2.pdf
Mod Sim for AR Rearden JUN18 2.pdf
 
Temp efv
Temp efvTemp efv
Temp efv
 
AIAA-MAO-DSUS-2012
AIAA-MAO-DSUS-2012AIAA-MAO-DSUS-2012
AIAA-MAO-DSUS-2012
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
 
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
Modeling the Chlorophyll-a from Sea Surface Reflectance in West Africa by Dee...
 
Hydrological Calibration in the Mount Lofty Ranges using Source Paramenter Es...
Hydrological Calibration in the Mount Lofty Ranges using Source Paramenter Es...Hydrological Calibration in the Mount Lofty Ranges using Source Paramenter Es...
Hydrological Calibration in the Mount Lofty Ranges using Source Paramenter Es...
 
IRJET- Underwater Acoustic Wsn:Routing Protocol
IRJET- Underwater Acoustic Wsn:Routing ProtocolIRJET- Underwater Acoustic Wsn:Routing Protocol
IRJET- Underwater Acoustic Wsn:Routing Protocol
 
annInstance28Nov6pm
annInstance28Nov6pmannInstance28Nov6pm
annInstance28Nov6pm
 
Final HPC Poster Compact
Final HPC Poster CompactFinal HPC Poster Compact
Final HPC Poster Compact
 

Predicting California’s Ecoregions

  • 1. PREDICTING CALIFORNIA’S ECOREGIONS: A COMPARISON OF SVM AND DECISION TREE APPROACHES Douglas Callaway CP5605: Advanced Data Mining and Knowledge Discovery 28 October, 2015
  • 2. MOTIVATION Explore popular classification algorithms Gain experience in R programming language Apply skills to an interesting case study
  • 3. PURPOSE Develop 2 models to predict Californian ecoregions Optimise each model’s parameters Develop an unbiased sampling technique Assess & compare each model’s performance using cross- validation
  • 4. APPROACHES & JUSTIFICATION Decision Tree Numeric or nominal predictions One or more categories Intuitive rules Unlimited models possible Support Vector Machine Numeric or nominal predictions One or more categories Abstract hyperplane class seperation One optimal model per training set http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_
  • 5. CASE STUDY: PREDICTING CALIFORNIA’S ECOREGIONS 12 distinct regions Determined by EPA in 2013 “Denote areas of general similarity in ecosystems and in the type, quality, and quantity of environmental resources.” (U.S. Environmental Protection Agency, 2013)
  • 6. CALIFORNIA WEATHER DATASET 30 years of weather data (1983-2012) 400 weather stations Some issues with missing data
  • 8. HANDLING MISSING VALUES: SPATIAL INTERPOLATION Nulls estimated from values of nearby measurements Closer neighbors given higher weight Inverse Distance Weighting (IDW)
  • 9. FEATURE SELECTION Provided Attributes Location (latitude, longitude, elevation) Monthly temperature  Extreme min/max  Monthly average Monthly precipitation Calculated Attributes Distance to coast 30-year temperature  Extreme min/max Average annual precipitation 55 attributes in total!
  • 11. 50 55 60 65 70 75 80 85 5 25 45 65 85 Accuracy(%) Test Set Size (%) SVM Accuracy vs. Cross Validation Test Set Size 1 8 20 30 40 50 60 70 80 3 23 43 63 83Accuracy(%) Test Set Size (%) Decision Tree Accuracy vs. Cross Validation Test Set Size Underfit Overfit 1 8 CHOOSING APPROPRIATE TEST SET SIZES Overfit Underfit
  • 12. UNBIASED SAMPLING: CHOOSING DISPERSED TEST LOCATIONS Weather stations spatially clustered around populated areas n random locations generated within study area Weather stations nearest random points chosen for test set
  • 13. ANALYSIS: SVM TUNING *Using R ‘svm() {e1071}’ and ‘tune() {e1071}' functions
  • 14. ANALYSIS: DECISION TREE TUNING *Using R ‘rpart() {rpart}’ and ‘tune() {e1071}' functions
  • 15. RESULTS: DECISION TREE CONFUSION MATRIX TRUE Total Accuracy: Ecoregion 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8 P R E D I C T E D 10.1.3 0 0 0 0 0 0 0 0 0 0 0 0 10.1.5 0 0 0 0 0 0 0 0 1 1 0 0 10.2.1 0 1 8 1 0 1 0 0 0 0 0 0 10.2.2 0 0 0 5 0 0 0 0 0 0 0 0 11.1.1 0 0 0 0 13 0 0 0 1 0 0 2 11.1.2 0 0 1 0 1 8 0 0 0 0 0 0 11.1.3 0 0 0 0 0 0 2 0 0 0 0 0 6.2.11 0 0 0 0 1 0 0 3 0 3 1 0 6.2.12 0 4 1 0 0 0 0 0 5 0 0 0 6.2.7 0 0 0 0 0 0 0 1 0 1 1 0 6.2.8 1 0 0 0 0 0 0 0 0 0 2 0 7.1.8 0 0 0 0 2 0 0 0 0 0 0 0 Accuracy/Class: 0.0% 0.0% 80.0% 83.3% 76.5% 88.9% 100.0% 75.0% 71.4% 20.0% 50.0% 0.0% 65.3%
  • 16. TRUE Total Accuracy : Ecoregion 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8 P R E D I C T E D 10.1.3 1 0 0 0 0 0 0 0 0 0 1 0 10.1.5 0 2 0 0 0 0 0 0 0 0 0 0 10.2.1 0 1 10 0 0 0 0 0 0 0 0 0 10.2.2 0 0 0 6 0 0 0 0 0 0 0 0 11.1.1 0 0 0 0 13 0 0 0 0 0 0 1 11.1.2 0 0 0 0 0 9 0 0 0 0 0 0 11.1.3 0 0 0 0 1 0 2 0 0 0 0 0 6.2.11 0 0 0 0 1 0 0 4 0 0 1 0 6.2.12 0 1 0 0 1 0 0 0 7 1 0 0 6.2.7 0 1 0 0 0 0 0 0 0 4 0 0 6.2.8 0 0 0 0 0 0 0 0 0 0 2 0 7.1.8 0 0 0 0 1 0 0 0 0 0 0 1 Accuracy/Class: 100.0% 40.0% 100.0% 100.0% 76.5% 100.0% 100.0% 100.0% 100.0% 80.0% 50.0% 50.0% 84.7% RESULTS: SVM CONFUSION MATRIX
  • 17. Ecoregion: 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8 TOTAL: SVM 100.0% 40.0% 100.0% 100.0% 76.5% 100.0% 100.0% 100.0% 100.0% 80.0% 50.0% 50.0% 84.7% Decision Tree 0.0% 0.0% 80.0% 83.3% 76.5% 88.9% 100.0% 75.0% 71.4% 20.0% 50.0% 0.0% 65.3% 10.1.3 10.1.5 10.2.1 10.2.2 11.1.1 11.1.2 11.1.3 6.2.11 6.2.12 6.2.7 6.2.8 7.1.8 SVM Decision Tree SVM VS DECISION TREE
  • 18. CONCLUSIONS & DISCUSSION Locations near borders difficult to classify Decision tree model more intuitive, less accurate SVM good at separating classes (high dimensionality) Applications Assess climate change risk  E.g. what areas are likely to change if temperatures increase? Find similar climates elsewhere  E.g. what other areas will support a crop normally grown in California?