This document summarizes a study that analyzed multivariate models to predict the presence or absence of the West Nile virus. Four classification models - logistic regression, linear discriminant analysis, random forests, and support vector machines - were developed using weather, temporal, and spatial factors. An ensemble model that combined the generalized additive model and support vector machine with weights of 0.6 and 0.4, respectively, achieved the best results with an AUC of 0.8361962. The models took into account the developmental stages of mosquitoes to better predict the transmission pattern of the West Nile virus.
MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS
1. 1
MULTIVARIATE
ANALYSIS
FARZAD
ESKANDANIAN,
MAX
LI,
JOYCE
ROSE,
NASIM
SONBOLI
CSC
424
|
ADVANCED
DATA
ANALYSIS
6|14|2015
The
purpose
of
this
paper
is
to
discuss
the
model(s)
used
in
predicting
the
presence
or
absence
of
the
West
Nile
virus
[WNV].
The
uniqueness
of
this
multivariate
analysis
is
the
use
of
weather,
temporal
and
spatial
factors
based
on
the
premise
of
time
based
effects.
That
is,
the
models
built
take
into
account
the
developmental
stages
of
a
mosquito.
Four
individual
classifiers
-‐
1)
logistic
regression
using
a
generalized
additive
model
(GAM),
2)
linear
discriminant
analysis
(LDA),
3)
random
forests,
and
4)
support
vector
machines
(SVM)
–
were
built
and
the
best
combinations
of
parameters
from
each
model
was
included
in
the
ensemble
model.
Species,
week
number,
location,
moving
temperature
averages,
precipitation
moving
averages
and
growing
degree
days
played
an
important
role
in
predicting
WNV.
The
best
overall
ensemble
classifier
was
a
weighted
average
of
GAM
and
SVM
with
weights
of
0.6
and
0.4,
respectively,
and
an
AUC
of
0.8361962
INTRODUCTION
The
west
Nile
Virus
(WNV)
is
“a
mosquito
borne
disease-‐causing
infectious
agent”
(Theophilides
et
al,
2006,
para.
1)
that
affects
birds,
humans,
and
animals.
In
1999,
WNV
was
first
reported
in
the
United
States.
Since
the
initial
occurrence
the
presence
of
WNV
causing
seasonal
epidemics
have
been
recorded
leading
to
a
series
of
research
focused
on
understanding
the
features
and
characteristics
of
the
virus.
The
research
available
on
WNV
indicates
that
“the
infections
caused
by
pathogens
by
way
of
a
mosquito
vector
often
cluster
in
space
and
time
given
the
habitat
requirements
of
the
vectors
and
the
vertebrate
involved
in
the
transmission.”
(Ruiz
et
al,
2007,
para
8).
In
other
words,
the
West
Nile
viral
transmission
is
attributed
to
the
patterns
of
climate,
landscape,
hydrology
and
types
of
human
settlements.
Ruiz
et
al
(2010)
argue
that
the
statistical
models
built
thus
far
by
researchers
are
mere
reports
that
only
characterize
associations
between
the
virus
and
weather,
landscape,
human
density
etc.
Though
they
offer
insights
about
the
WNV,
the
associations
themselves
are
not
enough
to
develop
and
implement
preventive
measures
for
future
epidemics.
The
interesting
aspect
of
the
WNV
challenge
arises
from
the
need
to
build
a
better
model
that
takes
into
account
the
life
cycle
of
the
mosquitoes
in
relationship
to
the
variability
in
weather
and
its
impact
“on
WEST
NILE
VIRUS
|
CHICAGO
2.
2
growth
or
activity
of
an
organism.”
Such
a
model
can
take
a
step
beyond
associations
and
indicate
what
the
best
time
and
location
is
for
early
intervention.
The
importance
of
building
a
robust
model
with
predictive
capabilities
lies
in
the
need
to
prevent
an
outbreak
in
the
future.
Therefore
the
goal
of
this
project
is
to
build
a
model
that
uses
weather,
temporal
and
spatial
factors
to
predict
the
West
Nile
virus.
DATA
DESCRIPTION
Kaggle’s
West
Nile
Virus
challenge
consists
of
the
following
datasets1:
Obs
Train
Weather
Spray
Test
10506
2944
14835
116293
Var
12
22
4
11
The
datasets
contains
a
combination
of
string
and
numeric
variables.
“In
many
cases,
some
predictors
have
no
values
for
a
given
sample.
These
missing
data
could
be
structurally
missing”
(Kuhn
&
Johnson,
p.41).
For
instance,
station
2
does
not
collect
information
on
depart,
depth,
water1,
snowfall,
sunset
and
sunrise.
These
structurally
missing
values
are
denoted
by
“M,”
“T”,
or
“-‐“.
“In
other
cases,
the
value
cannot
or
was
not
determined
at
the
time
of
the
model
building”
(Kuhn
&
Johnson,
p.41).
Examples
of
such
missing
values
are
tavg,
wetbulb,
heat,
cool,
preciptotal,
stnpressure,
sea
level,
time
[584
values]
and
average
speed.
Hence,
the
spray
data
and
the
weather
data
do
contain
missing
values.
The
missing
value
for
the
time
data
set
is
“concentrated
in
a
subset
of
predictors”
(Kuhn
&
Johnson,
p.41).
In
other
words,
the
584
missing
values
pertaining
to
the
spray
data
relates
to
09/07/2011
where
time
has
not
1
The fields for the datasets can be found in
Table 1 in the appendix titled “Data Fields”.
been
recorded
after
7:44:32
PM
and
before
7:46:30
PM.
The
non-‐structurally
missing
data
values
for
the
weather
dataset,
however,
appear
to
occur
randomly
across
all
the
predictors.
The
counts
of
missing
values
for
each
of
the
predictor
variables
have
been
tabulated
below.
The
response
variables
are
the
two
classes
that
the
model
aims
to
predict
namely
the
presence
or
absence
of
the
West
Nile
Virus
[1,
0].
The
explanatory
variables
are:
maximum
temperature,
minimum
temperature,
average
temperature,
precipitation,
result
wind
speed,
result
wind
direction,
species,
trap,
longitude,
latitude,
number
of
mosquitoes
and
address.
EXTERNAL
DATASETS
Although
Kaggle
already
provides
a
number
of
explanatory
variables
for
the
West
Nile
Virus
challenge,
there
are
ample
opportunities
to
include
external
datasets
that
may
contain
other
variables
that
can
improve
a
predictive
model’s
performance.
For
example,
Ruiz
et
al
(2010)
found
that
the
amount
of
vegetation
and
the
degree
to
which
water
would
flow
or
remain
in
an
area
mediated
the
effect
of
weather
in
predicting
the
infection
rate
of
West
Nile
Virus.
Socioeconomic
factors
that
measured
poverty
also
seemed
to
correlate
with
the
presence
of
West
Nile
Virus.
Bringing
in
additional
data
from
reliable
government
sources
that
reflect
the
aforementioned
3.
3
factors
will
help
us
finely
tune
our
predictive
models.
MULTIVARIATE
ANALYSIS
The
main
objective
of
a
multivariate
analysis
is
to
use
multiple
data
mining
techniques
to
study
how
variables
relate
to
one
another.
This
method
of
analysis
is
most
often
used
when
the
dataset
contains
more
than
one
explanatory
or
response
variable
or
even
both.
Kaggle’s
West
Nile
Virus
dataset
contains
one
response
variable
and
12
explanatory
variables.
Using
a
multivariate
analysis
for
such
a
dataset
is
desirable
because
the
final
outcome
of
accurately
predicting
the
presence
or
absence
of
WNV
might
be
influenced
by
more
than
one
attribute.
For
instance,
principal
component
analysis
can
be
used
to
“decompose
a
data
table
with
correlated
measurements
into
a
new
set
of
uncorrelated
(i.e.,
orthogonal)
variables”
(Abdi,
p.1).
Performing
PCA
will
determine
the
dominant
trends
in
the
dataset
upon
which,
for
example,
a
logistic
regression
model
can
be
applied.
Conducting
a
logistic
regression
alone
with
12
explanatory
variables
may
not
produce
a
stable
model
if
there
is
a
strong
dependence
between
predictors.
PCA
addresses
the
issue
of
multicollinearity
resulting
in
a
regression
model
that
accurately
estimates
the
response
variable.
Therefore,
the
advantages
and
disadvantages
of
using
one
technique
in
conjunction
with
another
in
light
of
the
number
of
explanatory
variables
offers
a
purpose
to
use
multivariate
analysis.
DATA
COLLECTION
The
dataset
provided
by
the
Chicago
Department
of
Public
health
and
NOAA
[National
Oceanic
and
Atmospheric
Administration]
comprises
of
weather
data2,
GIS
data3,
date
of
traps
set
[spanning
3
days
each
week
for
approximately
5
months],
location
of
traps
and
species
for
the
years
between
2007
and
2014.
The
main
dataset
is
broken
into
two
sets
of
data
that
is
the
training
and
the
testing
dataset.
The
training
dataset
reflects
data
points
collected
for
the
odd
years:
2007,
2009,
2011
and
2013.
Whereas,
the
testing
dataset
consists
of
data
points
gathered
for
the
even
years:
2008,
2010,
2012
and
2014.
There
are
two
central
factors
that
serve
as
the
premise
for
when
and
why
the
WNV
data
was
collected.
The
first
factor
is
weather.
“It
is
believed
that
hot
and
dry
conditions
are
more
favorable
for
West
Nile
virus
than
cold
and
wet.”
(Kaggle,
information
description,
para.
9)
Therefore,
the
dataset
captures
information
about
weather
[from
station
1
–
Chicago
O’Hare
International
Airport
–
and
station
2
–
Chicago
Midway
International
Airport]
only
for
the
months
of
late
May
through
early
October.
The
second
factor
is
the
availability
of
data
for
the
number
of
mosquitos’
trapped,
location,
species
identified
and
the
test
results
of
the
presence
or
absence
of
the
West
Nile
virus.
“Every
year
from
late-‐May
to
early-‐
October,
public
health
workers
in
Chicago
setup
mosquito
traps
scattered
across
the
city.
Every
week
from
Monday
through
Wednesday,
these
traps
collect
mosquitos,
and
the
mosquitos
are
tested
for
the
presence
of
West
Nile
virus
before
the
end
of
the
week.”
(Kaggle,
information
description,
para.
3)
It
is
no
coincidence
that
traps
are
only
set
out
in
late
spring
through
early
fall
when
the
weather
is
conducive
to
the
population
growth
in
mosquitos.
Identifying
the
location
2
Weather data has been collected only for dates
on which the traps were set
3
GIS data for spraying is only available from
2011 to 2013,
4.
4
of
the
traps,
the
number
of
mosquitos’
trapped,
the
species,
and
the
frequencies
of
each
species
infected
or
not
infected
with
the
virus
in
conjunction
with
weather
is
crucial
in
understanding
where
the
next
sporadic
growth
of
the
mosquitos
will
occur.
After
all,
the
goal
of
the
predictive
model
is
to
identify
the
presence
or
absence
of
the
WNV
by
predicting
the
occurrence
and
the
rate
of
mosquito
growth
in
one
particular
location
over
another
given
a
set
of
weather
conditions.
Such
predictions
can
be
used
by
the
City
of
Chicago
and
CPHD
“to
efficiently
and
effectively
allocate
resources”
to
control
the
population
growth
of
mosquitos
which
in
turn
prevents
the
transmission
of
the
“potentially
deadly
virus.”
DATA
MERGING
The
West
Nile
training
dataset
does
not
contain
the
weather
variables
required
for
a
robust
analysis.
Therefore,
the
weather
dataset
has
been
merged
with
the
train
file
resulting
in
a
merged
file
titled
“wnv.train.weather.”
The
unique
identifier
used
to
merge
both
files
are
date
and
station.
Since
the
NOAA
Weather
dataset
provides
weather
data
from
two
weather
stations
located
in
the
Greater
Chicago
Area,
the
distance
was
calculated
from
the
site
of
individual
traps
to
each
of
the
two
weather
stations
and
was
used
to
select
the
appropriate
weather
information
for
each
training
record
based
on
the
proximity
of
the
two
weather
stations.
Two
distance
metrics
were
considered:
1)
Euclidean
distance
formula,
𝐷 = (𝑙𝑎𝑡!"#"$%& − 𝑙𝑎𝑡!"#$)! + (𝑙𝑜𝑛𝑔!"#"$%& − 𝑙𝑜𝑛𝑔!"#$)!
as
well
as
2)
Haversine
formula
(http://en.wikipedia.org/wiki/Haversine_for
mula)
when
taking
into
account
the
curvature
of
the
Earth,
The
“geosphere”
R
package
was
used
to
calculate
the
Haversine
formula
for
distance.
NEW
FEATURES
Ruiz
et
al.
(2010)
reported
the
importance
of
temporal
characteristics
of
weather
in
predicting
infection
rates
of
WNV
in
Northern
Illinois.
For
example,
they
found
a
positive
correlation
at
1
to
3
week
lags
between
precipitation
and
infection
rates.
Based
on
this
research
new
features
were
created
to
capture
this
information
in
the
weather
dataset,
namely
a
2
week
moving
average
of
precipitation
as
well
as
a
2
week
moving
sum
of
accumulated
rainfall.
Also,
time-‐based
effects
of
temperature
was
explored
and
this
entailed
the
use
of
a
metric
known
as
growing
degree
days
(GDD)
to
measure
heat
accumulation
used
to
predict
mosquito
development
rates.
GDD
was
calculated
as
𝐺𝐷𝐷 =
𝑇!"#$ − 𝑇!"#$, 𝑖𝑓 𝑇!"#$ > 𝑇!"#$
0, 𝑖𝑓 𝑇!"#$ ≤ 𝑇!"#$
where
Tbase
represents
a
threshold
temperature
where
an
organism’s
growth
rate
is
near
zero.
From
reviewing
literature,
Tbase
can
range
between
13°C
and
33°C.
We
will
vary
Tbase
and
observe
the
threshold
value
that
yields
the
best
performing
model.
Other
features
that
were
created
from
the
base
training
data
include
the
specific
week
number
of
a
year.
It
is
expected
that
the
abundance
of
mosquitos
and
consequently,
the
presence
of
WNV,
to
be
more
prevalent
during
certain
times
of
the
year.
Therefore
it
5.
5
is
surmised
that
the
week
number
will
be
important
in
predicting
the
timing
of
WNV.
CATEGORICAL
VARIABLES
Dealing
with
categorical
variables
can
pose
certain
limitations.
For
example,
if
a
variable
in
a
given
data
set
contains
several
categories
there
arises
a
need
to
re-‐categorize
the
classes
into
smaller
groups
for
the
sake
of
simplicity
and
the
robustness
of
the
predictive
model.
In
addition,
depending
on
the
data
mining
technique
used
the
need
to
use
numerical
data
than
categorical
data
becomes
eminent.
The
categorical
variables
found
in
the
WNV
dataset
have
undergone
transformations
in
the
form
of
re-‐categorization.
For
instance,
variable
species
is
categorical
with
seven
classes
as
indicated
in
the
table
below:
Table
1
Species
However,
table
1
species
indicates
that
3
species
specifically
have
been
tested
positive
for
WNV.
Re-‐categorization
highlights
the
importance
of
the
three
classes
associated
with
WNV
leaving
the
other
four
classes
to
be
grouped
in
a
category
of
its
own
indicative
of
the
lack
of
attribution
to
the
spread
of
WNV4.
It
is
also
important
to
note
that
the
training
set
has
a
class
titled
“uncategorized.”
By
creating
the
fourth
category
called
“Culex
Other”
the
issue
of
the
unidentified
species
is
addressed
effectively.
4
Table 2 titled Species 2 contains the new
groupings
The
re-‐categorization
approach
has
been
applied
to
the
variable
date
as
well.
EXPLORARTORY
DATA
ANALYSIS
One
of
the
prime
focus
of
an
exploratory
data
analysis
is
to
check
whether
the
specific
characteristic(s)
of
a
data
set
meets
the
requirements
of
the
modeling
technique(s)
to
be
used
as
some
models
maybe
sensitive
to
certain
types
of
data.
That
is,
how
is
the
data
set
distributed?
Skewedness
of
a
distribution
whether
it
is
positive
or
negative
is
often
a
result
of
a
“subset
of
observations
that
appear
to
be
inconsistent
with
the
remaining
observations
that
follow
a
hypothesized
distribution.”
(Sim
et
al,
2005,
pg.642).
Histograms
and
box
plots
are
graphical
tools
widely
used
to
inspect
the
data
for
the
presence
of
outliers.
There
are
two
important
questions
to
address
after
visually
inspecting
the
boxplot:
first,
is
it
possible
for
the
boxplot
to
incorrectly
declare
certain
points
as
outliers.
Second,
does
the
presence
of
outliers
imply
the
need
for
a
transformation?
The
graphical
representation
of
the
box
plots5
for
the
West
Nile
dataset
has
identified
certain
variables
to
be
skewed
with
the
presence
of
outliers.
For
instance,
the
distribution
of
the
number
of
mosquitos
is
right
skewed.
The
5
All
histograms
and
box
plots
with
short
description
of
shape,
center
and
spread
for
the
WNV
data
set
can
be
found
in
the
appendix.
6.
6
distribution
being
pulled
to
the
right
by
the
largest
number
in
the
data
set
for
the
respective
column.
The
IQR6
rule
for
outliers
indicates
that
values
lying
below
-‐20
and
above
39.5
are
potential
outliers.
On
examining
the
number
of
mosquitos
trapped
for
each
species
it
is
apparent
that
class
imbalance
plays
an
important
role
in
the
skewedness
of
the
data
as
shown
in
Table
2.
Table
2:
Number
of
Mosquitos
Trapped
All
numbers
above
39.5
represent
the
species
attributed
to
the
WNV
and
the
location
where
it
abounds.
There
exists
a
pattern
between
the
type
of
species,
the
location
and
the
number
of
mosquitos
trapped
that
is
beyond
the
scope
of
the
boxplot.
Similarly
the
boxplot
for
most
of
the
weather
variables
in
the
WNV
dataset
shows
the
presence
of
outliers.
However,
yearly,
monthly,
weekly
and
daily
variations
in
weather
are
infinite
and
the
differences
in
data
points
for
station
1
and
2
can
be
due
to
the
geographical
locations
of
the
stations
and/or
the
way
in
which
the
instruments
record
the
temperatures.
The
Natural
Resources
Management
and
Environment
Department
furthers
this
argument
by
stating
that
“weather
data
collected
at
a
given
weather
station
during
a
period
of
several
years
may
be
not
homogeneous,
i.e.,
the
data
set
representing
a
particular
weather
variable
may
present
a
6
The
appendix
contains
a
table
titled
“Lower
and
Upper
Bound
Outliers”
sudden
change
[from
one
weather
station
to
another].
This
phenomenon
may
occur
due
to
several
causes,
some
of
which
are
related
to
changes
in
instrumentation
and
observation
practices,
and
others,
which
relate
to
modification
of
the
environmental
conditions
of
the
site”
or
even
“change
in
the
time
of
the
observations.”
(para.14)
Thus,
the
skewedness
of
the
distribution
is
not
necessarily
a
consequence
of
extreme
data
points.
However,
it
is
a
result
of
class
imbalance.
For
instance,
the
histogram
for
the
accumulated
degree
day
shows
that
distribution
is
skewed
to
the
right.
But
when
the
histogram
is
constructed
taking
into
consideration
the
presence
or
absence
of
WNV
it
becomes
clear
that
imbalanced
class
is
the
root
of
the
skewedness
as
seen
in
the
histograms
below:
The
histograms
show
that
there
are
no
wnvpresent
at
lower/higher
degree
days.
However,
the
histograms
for
acc.deg.day
when
wnvpresent
=
0
or
1
and
0
appears
to
be
more
flat.
In
order
to
remove
distribution
skewness
the
data
points
was
replaced
by
the
square
root.
Thus
resulting
in
a
data
that
is
better
behaved
than
in
its
original
units.
7.
7
In
addition
to
skewness,
another
factor
that
affects
the
predictive
capability
of
a
model
is
the
presence
of
outliers.
As
noted
earlier,
the
weather
data
consists
of
outliers.
“For
a
large
dataset,
removal
of
samples
based
on
missing
values
is
not
a
problem,
assuming
the
missingness
is
not
informative”
(Kuhn
&
Johnson,
2013,
p.41).
However,
a
more
robust
way
of
handling
missing
information
is
by
imputation.
“Imputation
is
layer
of
modelling
where
missing
values
are
estimated
based
on
other
predictor
variables.
This
amounts
to
a
predictive
model
within
a
predictive
model”
(Kuhn
&
Johnson,
2013,
p.42).
Missing
values
in
the
weather
data
set
have
been
addressed
by
the
implementation
of
hot
deck
imputation
where
each
missing
value
is
replaced
with
an
observed
value
from
a
similar
unit.
“An
attractive
feature
of
the
hot
deck
imputation
is
that
only
plausible
values
can
be
imputed
since
values
come
from
observed
responses
in
the
donor
pool”
(Andridge
&
Little,
2011,
para.
3)
which
means
that
the
weather
data
is
more
likely
to
be
similar
to
the
other
data
points
than
imputing
averages.
The
second
advantage
of
using
hot
deck
imputation
is
that
the
“method
does
not
rely
on
model
fitting
for
the
variable
to
be
imputed
and
thus
is
potentially
less
sensitive
to
model
misspecification
than
an
imputation
method
based
on
a
parametric
method
such
as
regression
imputation”
(Andridge
&
Little,
2011,
para.
3).
CORRELATION
ANALYSIS
There
are
specific
variables
in
the
dataset
that
reveal
interesting
patterns
such
as
the
number
of
mosquitos,
temperature
and
precipitation.
The
goal
of
the
correlation
analysis
was
to
plot
or
capture
a
trend
that
would
explain
the
relationship
between
the
variables
and
the
presence
of
the
West
Nile
Virus.
Since
the
variables
are
on
different
scales
the
variables
were
normalized
using
the
Z
score
formula.
In
addition
to
normalizing
the
data,
average
values
of
the
said
variables
were
considered
in
building
the
plots.
The
plots
pertain
to
weekly
records
captured
for
4
years:
2007,
2009,
2011
and
2013
for
the
months
between
late
May
and
early
October.
Individual
plots
have
been
drawn
for
each
year.
The
blue
line
shows
the
average
precipitation.
The
red
line
shows
the
average
number
of
mosquitos,
the
green
line
shows
the
average
temperature
and
the
purple
line
shows
the
presence
of
the
virus.
Figure
1:
2007
According
to
the
line
graph
for
the
year
2007,
a
sudden
decrease
in
temperature
causes
mosquitos
to
decrease
after
week
35.
Consequently,
the
average
number
of
detected
virus
decreases.
It
was
also
noted
that
the
higher
the
temperature
and
the
precipitation
gets,
the
higher
the
number
of
mosquitos
and
subsequently
the
higher
the
probability
for
the
presence
of
the
West
Nile
virus.
An
interesting
pattern
was
found
between
precipitation
and
the
increase
in
the
number
8.
8
of
mosquitos.
The
increase
in
the
number
of
Figure
2:
2009
mosquitos
occurs
rapidly
not
during
the
week
of
high
precipitation
but
in
the
week
after.
It
appears
that
once
the
numbers
of
mosquitos’
increase.
Then
the
virus
infects
the
mosquitos.
The
number
of
mosquitos
in
week
35
is
low.
However,
the
graph
shows
that
the
presence
of
the
virus
is
prominent
than
before
indicating
that
all
of
the
mosquitos
have
the
virus
in
their
blood
although
the
mosquito
population
is
small.
Not
surprisingly,
as
the
temperature
declines
rapidly
[even
with
high
precipitation],
the
number
of
mosquitos
and
the
presence
of
WNV
drops.
All
plots
have
captured
similar
trends.
Figure
3:
2011
Figure
4:
2013
The
scatterplots
below
shows
that
the
number
of
mosquitos
and
the
presence
of
WNV
has
a
positive
relationship
with
dmonth,
dweek,
dewpoint,
cool,
tmax,
tmin,
tavg
and
spray.
Therefore,
the
model
will
certainly
rely
on
these
features
more
than
the
others
to
predict
WNV.
Though
the
relationships
are
positive
the
strength
however,
appears
to
be
weak.
A
closer
look
at
the
scatterplots
shows
some
evidence
of
multicolinearity.
For
instance,
in
the
plot
titled
temp
and
weather
there
are
blocks
of
strong
positive
correlations
that
indicate
colinearity.
An
issue
to
consider
in
the
modeling
process.
MODELS
Accurately
predicting
the
presence
of
WNV
essentially
amounts
to
selecting
the
best
spatial,
temporal
and
weather
features
along
with
a
specifically
tuned
classification
algorithm.
It
is
evident
from
the
exploratory
analysis
as
well
as
from
literature
that
certain
individual
features
are
crucial
in
predicting
WNV.
Therefore,
the
modeling
process
for
this
data
set
will
be
broken
into
two
parts.
Part
I,
will
focus
on
determining
how
to
best
incorporate
the
available
features
into
a
classification
model.
Part
II,
will
focus
on
investigating
and
9.
9
fine
tuning
the
specific
classification
algorithms
to
yield
the
best
possible
prediction.
Part
I
Weather
Data
and
Principal
Component
Analysis
Due
to
the
number
of
weather
attributes
available
to
the
researcher
in
the
dataset,
it
becomes
quite
difficult
to
ascertain
the
combination
that
will
result
in
the
best
model.
Moreover,
the
nature
of
weather
is
such
that
most
individual
features
will
be
correlated
to
another
resulting
in
multicolinearity.
For
example,
the
amount
of
precipitation
will
be
correlated
to
atmospheric
pressure
and
in
turn,
be
correlated
to
temperature.
Therefore
to
combat
multicolinearity
principal
component
analysis
(PCA)
was
used
to
extract
features
that
highlight
the
similarities
and
differences
of
the
original
weather
data
while
eliminating
the
detrimental
effects
that
can
result
from
the
linear
dependency
of
predictor
variables.
Figure
5
summarizes
the
results
of
PCA
conducted
on
the
weather
attributes.
The
first
five
components
capture
97%
of
the
variation
in
the
weather
data.
The
loadings
of
component
1
suggest
it
is
highly
related
to
temperature,
humidity
and
pressure;
a
large
value
for
component
1
seems
to
represent
a
sunny
but
chilly
day.
Component
2
appears
to
capture
wind
information,
while
component
3
summarizes
precipitation.
The
first
5
components
from
PCA
will
be
used
to
reflect
the
weather
conditions
of
a
specific
day
in
the
data.
Figure
5:
PCA
Figure
6:
Clustering
10.
10
Figure
7:
Model
Summary
Temporally
based
weather
variables
and
week
number
While
the
weather
conditions
of
a
specific
day
can
affect
the
activity
level
of
mosquitos
for
that
day,
it
does
not
take
into
account
a
mosquito’s
life-‐cycle
or
the
timing
of
weather
conditions
and
its
effect
on
mosquito
populations.
Hence,
engineered
features
such
as
growing
degree
day,
moving
temperature
averages/sums
and
moving
precipitation
averages/sums
(all
mentioned
in
previous
sections)
will
be
included
in
the
model.
Also,
week
numbers
of
the
year
will
be
incorporated
to
capture
the
inter-‐annual
timing
of
mosquito
populations.
Clustering
Location
Data
Determining
a
good
way
to
represent
location
will
most
likely
improve
the
predictive
power
of
the
models.
Although,
the
WNV
challenge
provides
raw
longitude
and
latitude
values
to
represent
location,
it
is
believed
to
not
be
in
a
form
that
will
be
conducive
to
predictive
modeling
due
to
the
non-‐linear
nature
of
spatial
data.
Thus
k-‐means
algorithm
(k
=
20)
was
used
to
translate
the
location
data
represented
by
longitude/
latitude
pairs
into
clustered
locations.
Figure
6
shows
the
location
of
the
clusters
using
a
normalized
scale.
As
one
can
observe,
the
clustered
locations
outline
the
Chicago
area
quite
accurately.
These
clustered
locations
will
be
used
as
a
categorical
variable
in
our
models.
Part
II
With
the
necessary
data
pre-‐processing
and
variable
transformations
completed.
The
focus
was
moved
onto
the
construction
of
models
to
predict
WNV.
The
overall
approach
was
to
build
an
ensemble,
a
model
that
takes
a
weighted
average
of
a
set
of
classifiers
that
generally
outperforms
the
individual
classifiers
upon
which
the
ensemble
is
built
from.
The
strategy
was
to
consider
five
individual
algorithms
and
build
the
best
possible
classifier
out
of
each
to
include
in
the
final
ensemble
model:
1)
logistic
regression
using
a
generalized
additive
model
(GAM),
2)
linear
discriminant
analysis
(LDA),
3)
random
forests,
and
4)
support
vector
machines
(SVM).
Kaggle’s
train
dataset
was
split
by
70%
and
30%
probabilities
where
the
70%
was
used
as
the
training
set
and
the
remaining
30%
served
as
the
hold
out
for
the
test
dataset.
Figure
7
is
a
summary
of
all
the
best
set-‐ups
for
each
algorithm.
Of
all
the
individual
models,
GAM
was
clearly
the
best
performing
with
an
AUC
value
of
0.8253717.
The
best
overall
ensemble
classifier
was
a
weighted
average
of
GAM
and
SVM
with
weights
of
0.6
and
0.4,
respectively,
and
an
AUC
of
0.8361962.
11.
11
CONCLUSION
Although
the
ensemble
model
had
the
highest
AUC
value
achieved
in
the
training
dataset,
it
only
reached
an
AUC
of
0.6220
on
the
Kaggle
leaderboard.
In
fact,
over
50
models
were
submitted
to
Kaggle
and
the
results
were
rarely
as
expected.
The
two
best
models
on
the
leaderboard
consisted
of
an
ensemble
of
GAM
logistic
regression
and
GLM
logistic
regression
and
a
slightly
modified
Poisson
GLM
model.
Both
did
not
have
notable
training
AUCs
but
performed
well
on
Kaggle.
Other
validation
techniques
were
investigated
in
an
attempt
to
obtain
better
feedback
from
the
training
process
which
resulted
in
the
build
of
a
better
model.
Instead
of
using
a
70/30
training
and
testing
split,
a
modified
version
of
n-‐fold
cross
validation
was
used
where
one
year’s
data
was
left
out
as
testing
and
the
remaining
years
were
used
as
training.
This
process
was
repeated
four
times,
once
for
each
year,
and
this
averaged
the
model’s
performance.
The
best
models
achieved
from
this
validation
technique
did
not
seem
any
different
from
the
models
built
on
a
traditional
70/30
split.
Figure
8:
Models
&
Imbalance
Because
there
is
a
gross
imbalance
of
positive
and
negative
cases
in
the
WNV
data
further
examination
was
conducted
to
see
if
the
imbalance
had
any
influence
on
the
effectiveness
of
training
and
validation.
Figure
8
shows
the
performance
of
several
models
and
its
relationship
with
data
imbalance.
Except
for
one
model,
none
displayed
a
drastic
sensitivity
to
data
balance.
If
using
the
appropriate
validation
technique
does
not
account
for
the
disparity
between
training
AUC
and
the
Kaggle
leaderboard
AUC,
it
is
surmised
that
there
may
be
a
fundamental
difference
between
the
characteristics
of
the
training
data
and
testing
data.
Specifically,
it
is
possible
that
there
are
idiosyncratic
intra-‐annual
variations
in
weather
that
cannot
be
captured
in
the
training
set
due
to
how
the
WNV
problem
is
set
up.
Ezanno
et
al
(2014)
cites
that
population
of
certain
mosquito
species
does
in
fact
have
inter-‐annual
variations
due
to
specific
weather
events
in
a
year.
It
is
therefore
suspected,
that
the
best
algorithms
discussed
afore
are
over
fitting
the
training
data.
While
the
best
models
in
this
study
capture
the
variations
in
weather
in
the
training
data
well,
it
is
unable
to
replicate
this
in
the
testing
data.
This
intuitively
makes
sense
as
most
of
the
models
that
performed
better
on
Kaggle
tend
to
be
simple
models
that
included
variables
like
location,
week
number
and
mosquito
species
that
is
generalizable
through
all
years
of
the
data.
12.
12
Other
matter
of
consideration
for
future
model
building
is
the
importance
of
the
spray
data.
Though
the
spray
data
is
not
a
part
of
the
testing
dataset
and
would
warrant
an
immediate
dismissal
from
the
predictor
selection
process,
the
following
heat
map
implies
otherwise.
Upon
close
inspection
of
the
heat
map
one
speculates
that
spraying
one
year
does
indeed
alter
the
effects
of
population
the
next
year,
which
might
explain
why
mosquito
populations
appear
in
different
locations
each
year.
Also,
feature
engineering
of
the
predictor
variable,
depart
[departure
from
normal],
might
help
in
creating
a
deeper
level
of
understanding
the
problem
statement
at
hand.
A
possible
means
of
engineering
this
predictor
would
be
to
categorize
the
deviance
from
temperature
normalcy
as
hotter
than
normal
and
colder
than
normal.
13.
13
Appendix
Table
3:
Data
Fields
FIELDS
Number
Train
Weather
Spray
Test
1
Date
Station
Date
ID
2
Address
Date
Time
Date
3
Species
Max
Temperature
Latitude
Address
4
Block
Min
Temperature
Longitude
Species
5
Street
Avg
Temperature
Block
6
Trap
Departure
from
Normal
Street
7
Address
Number
Dew
Point
Trap
8
Latitude
Wet
Bulb
Address
Number
9
Longitude
Heat
Latitude
10
Address
Accuracy
Cool
Longitude
11
#
of
Mosquitoes
Sunrise
Address
Accuracy
12
Wnvpresent
Sunset
13
Code
Sum
14
Depth
15
Water1
16
Snowfall
17
Total
Precipitation
18
Station
Pressure
19
Sea
Level
20
Wind
Speed
21
Wind
Direction
22
Average
Speed
15.
15
SKEWNESS OF VARIABLES & OUTLIERS
DATE PATTERN
The data is skewed to the
left. There are more records
for 2007 than other years
but not by a significant
amount. If this becomes
problematic, we may
sample equal number of
records for each year.
16.
16
LATITUDE PATTERN
Shape:
Latitude
is
very
slightly
skewed
to
the
left.
Mean
is
less
than
the
median
Center:
41.84628
Spread: 41.64461 to 42.01743
17.
17
LONGITUDE PATTERN
Shape:
Longitude
is
symmetric
Center:
-‐87.69499
Spread: -87.93099 to -87.53163
18.
18
NUMBER OF MOSQUITOS PATTERN
Shape:
The
distribution
is
right
skewed
as
the
mean
is
12.85351
being
pulled
to
the
right
away
from
the
median
which
is
5
Center:
5
Spread: 1 to 50
Outlier: The boxplot confirms the
skewedness of the histogram in that
there are large numbers causing the
distribution to be pulled to the right.
The outlier function indicates the
largest number in the data for
number of mosquitos is 50
19.
19
DISTANCE FROM O’HARE PATTERN
Shape:
The
distribution
is
symmetric
Center:
0.2943334
Spread: 0.0372549 to 0.5179756
20.
20
DISTANCE FROM MIDWAY PATTERN
Shape:
The
distribution
is
slightly
skewed
to
the
left
as
the
mean
0.1548598
is
pulled
away
from
the
median
0.1616137
Center:
0.1616137
Spread: 0.0077139 to 0.2481943
21.
21
MAXIMUM TEMPERATURE PATTERN
Shape:
The
distribution
is
s
skewed
to
the
left
as
the
mean
81.94765
is
pulled
away
to
the
left
from
the
median
83
Center:
83
Spread: 57 to 97
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 57 is the
point that is distant from the other
values in the dataset.
22.
22
MINIMUM TEMPERATURE PATTERN
Shape:
The
distribution
is
s
skewed
to
the
left
as
the
mean
64.16533
is
pulled
away
to
the
left
from
the
median
66
Center:
66
Spread: 41 to 79
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 41 is the
point that is distant from the other
values in the dataset.
23.
23
AVERAGE TEMPERATURE PATTERN
Shape:
The
distribution
is
skewed
to
the
left
as
the
mean
38.28412
is
pulled
away
to
the
left
from
the
median
40
Center:
40
Spread: 15 to 52
Outlier: The box plot shows
the presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 15 is the
point that is distant from the other
values in the dataset.
24.
24
TOTAL PRECIPITATION PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
0.1274281
is
pulled
away
to
the
right
from
the
median
0
Center:
0
Spread: 0.00 to 3.97
Outlier: The box plot shows
the presence of some points
influencing the movement of the
distribution to the right. The outlier
function indicates that 3.97 is the
point that is distant from the other
values in the dataset.
25.
25
RESULT OF WIND SPEED PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
5.911003
is
pulled
away
to
the
left
from
the
median
5.5
Center:
5.5
Spread: 0.1 to 15.4
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the right. The outlier
function indicates that 15.4 is the
point that is distant from the other
values in the dataset.
26.
26
RESULT OF WIND DIRECTION PATTERN
Shape:
The
distribution
is
skewed
to
the
left
as
the
mean
17.72016
is
pulled
away
to
the
left
from
the
median
19
Center:
19
Spread: 1 to 36
27.
27
AVERAGE WIND SPEED PATTERN
Shape:
The
distribution
is
skewed
to
the
left
as
the
mean
123.4147
is
pulled
away
to
the
left
from
the
median
139
Center:
139
Spread: 3 to 177
Outlier: The box plot shows
the presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 3 is the
point that is distant from the other
values in the dataset.
28.
28
TEMPERATURE MOVING AVERAGES - 1 WEEK PATTERN
Shape:
The
distribution
is
skewed
to
the
left
as
the
mean
72.5431
is
pulled
away
to
the
left
from
the
median
73.14286
Center:
73.14286
Spread: 53.14286 to 83.85714
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 53.14286 is
the point that is distant from the
other values in the dataset.
29.
29
TEMPERATURE MOVING AVERAGES – 2 WEEK PATTERN
Shape:
The
distribution
is
skewed
to
the
left
as
the
mean
72.41439
is
pulled
away
to
the
left
from
the
median
73
Center:
73
Spread: 55.07143 to 82.76923
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 55.07143 is
the point that is distant from the
other values in the dataset.
30.
30
MOVING AVGS OF PRECIPITATION – 1 WEEK PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
0.1333564
is
pulled
away
to
the
right
from
the
median
0.07
Center:
0.07
Spread: -0.0000 to 1.42857
Outlier: The box plot shows the
presence of some points influencing
the movement of the distribution to the
right. The outlier function indicates
that 1.42857 is the point that is distant
from the other values in the dataset.
31.
31
MOVING AVGS OF PRECIPITATION – 2 WEEK PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
0.130
is
pulled
away
to
the
right
from
the
median
0.085
Center:
0.085
Spread: 0.0007 to 0.76714
Outlier: The box plot shows the
presence of some points influencing
the movement of the distribution to
the right. The outlier function
indicates that 0.76714 is the point
that is distant from the other values in
the dataset.
32.
32
MOVING SUM OF PRECIPITATION – 1 WEEK PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
0.9432334
is
pulled
away
to
the
right
from
the
median
0.53
Center:
0.53
Spread: -0.000 to 9.149
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the right. The outlier
function indicates that 9.15 is the
point that is distant from the other
values in the dataset.
33.
33
MOVING SUM OF PRECIPITATION – 2 WEEK PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
1.74216
is
pulled
away
to
the
right
from
the
median
1.1
Center:
1.1
Spread: -0.000 to 10.74999
Outlier: The box plot shows the
presence of some points influencing
the movement of the distribution to
the right. The outlier function
indicates that 10.75 is the point that
is distant from the other values in the
dataset.
34.
34
DEGREE DAY PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
3.824472
is
pulled
away
to
the
right
from
the
median
3.4
Center:
3.4
Spread: 0.0 to 14.9
35.
35
ACCUMULATED DEGREE DAY FOR EACH YEAR PATTERN
Shape:
The
distribution
is
skewed
to
the
right
as
the
mean
241.0934
is
pulled
away
to
the
right
from
the
median
239.6
Center:
239.6
Spread: 1.3 to 521.1
37.
37
GROUPED LINE GRAPH | YEAR 2007
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus
38.
38
GROUPED LINE GRAPH | YEAR 2009
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
39.
39
GROUPED LINE GRAPH | YEAR 2011
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
40.
40
GROUPED LINE GRAPH | YEAR 2013
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
41.
41
Works Cited
Abdi, Herve. Multivariate analysis. Retrieved from
www.utdallas.edu/~herve/Abdi-MultivariateAnalysis-pretty.pdf
Andridge & Little. (2011). A review of hot deck imputation for survey non – response
Int Stat Rev. 78(1): 40-64. Retrieved from
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/
Ezanno, P, Aubry-Kientz, M et al. (2015). A generic weather driven model to predict
Mosquito population dynamics applied to species of anopheles, culex
And aedes genera of southern France. 120(1): 39-50. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/25623972
Kaggle. West Nile Prediction. Retrieved from: https://www.kaggle.com/c/predict-
west-nile-virus/data
Kuhn & Johnson (2013). Applied Predictive Modeling. New York, Springer.
Natural Resources Management and Environmental Departments. Annex 4:
Statistical Analysis of Weather Data Sets 1. Retrieved from:
http://www.fao.org/docrep/x0490e/x0490e0l.htm#TopOfPage
Ruiz, Marilyn O., F Chavez Luis et al. (2010). Local impact of temperature and
precipitation on west Nile virus infection in culex species mosquitoes
in northeast Illinois, USA. Parasites & Vectors. Retrieved from
http://www.parasitesandvectors.com/content/3/1/19.
Ruiz, Marilyn 0., Edward D. Walker et al.(2007). Association of west nile virus
illness and urban landscapes in Chicago and Detroit. International
Journal of Health Geographics.
Theophilidies, C.N., S.C. Ahearni et al. (2006). First evidence of west nile virus
amplification and relationship to human infections. International
Journal of Geographical Information Science, 20, 103 -115.
Sim, C.H, Gan, F. F. et al (2005), Outlier: labeling with boxplot procedures.
Journal of American Statistical Association, 100(470).
Retrieved from: http://www.jstor.org/stable/27590584