MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS

1

MULTIVARIATE
ANALYSIS

FARZAD
ESKANDANIAN,
MAX
LI,
JOYCE
ROSE,
NASIM
SONBOLI

CSC
424
|
ADVANCED
DATA
ANALYSIS

6|14|2015

The
purpose
of
this
paper
is
to
discuss
the
model(s)
used
in
predicting
the
presence
or
absence
of
the

West
Nile
virus
[WNV].

The
uniqueness
of
this
multivariate
analysis
is
the
use
of
weather,
temporal

and
spatial
factors
based
on
the
premise
of
time
based
effects.
That
is,
the
models
built
take
into

account
the
developmental
stages
of
a
mosquito.
Four
individual
classifiers

-‐
1)
logistic
regression

using
a
generalized
additive
model
(GAM),
2)
linear
discriminant
analysis
(LDA),
3)
random
forests,

and
4)
support
vector
machines
(SVM)
–
were
built
and
the
best
combinations
of
parameters
from

each
model
was
included
in
the
ensemble
model.
Species,
week
number,
location,
moving

temperature
averages,
precipitation
moving
averages
and
growing
degree
days
played
an
important

role
in
predicting
WNV.
The
best
overall
ensemble
classifier
was
a
weighted
average
of
GAM
and
SVM

with
weights
of
0.6
and
0.4,
respectively,
and
an
AUC
of
0.8361962

INTRODUCTION

The
west
Nile
Virus
(WNV)
is
“a
mosquito

borne
disease-‐causing
infectious
agent”

(Theophilides
et
al,
2006,
para.
1)
that
affects

birds,
humans,
and
animals.
In
1999,
WNV

was
first
reported
in
the
United
States.
Since

the
initial
occurrence
the
presence
of
WNV

causing
seasonal
epidemics
have
been

recorded
leading
to
a
series
of
research

focused
on
understanding
the
features
and

characteristics
of
the
virus.
The
research

available
on
WNV
indicates
that
“the

infections
caused
by
pathogens
by
way
of
a

mosquito
vector
often
cluster
in
space
and

time
given
the
habitat
requirements
of
the

vectors
and
the
vertebrate
involved
in
the

transmission.”
(Ruiz
et
al,
2007,
para
8).

In
other
words,
the
West
Nile
viral

transmission
is
attributed
to
the
patterns
of

climate,
landscape,
hydrology
and
types
of

human
settlements.
Ruiz
et
al
(2010)
argue

that
the
statistical
models
built
thus
far
by

researchers
are
mere
reports
that
only

characterize
associations
between
the
virus

and
weather,
landscape,
human
density
etc.

Though
they
offer
insights
about
the
WNV,
the

associations
themselves
are
not
enough
to

develop
and
implement
preventive
measures

for
future
epidemics.
The
interesting
aspect
of

the
WNV
challenge
arises
from
the
need
to

build
a
better
model
that
takes
into
account

the
life
cycle
of
the
mosquitoes
in
relationship

to
the
variability
in
weather
and
its
impact
“on

WEST
NILE
VIRUS
|
CHICAGO

2

growth
or
activity
of
an
organism.”
Such
a

model
can
take
a
step
beyond
associations
and

indicate
what
the
best
time
and
location
is
for

early
intervention.
The
importance
of
building

a
robust
model
with
predictive
capabilities

lies
in
the
need
to
prevent
an
outbreak
in
the

future.
Therefore
the
goal
of
this
project
is
to

build
a
model
that
uses
weather,
temporal
and

spatial
factors
to
predict
the
West
Nile
virus.

DATA
DESCRIPTION

Kaggle’s
West
Nile
Virus
challenge
consists
of

the
following
datasets1:

Obs

Train
Weather
Spray
Test

10506
2944
14835
116293

Var
12
22
4
11

The
datasets
contains
a
combination
of
string

and
numeric
variables.

“In
many
cases,
some
predictors
have
no

values
for
a
given
sample.
These
missing
data

could
be
structurally
missing”
(Kuhn
&

Johnson,
p.41).
For
instance,
station
2
does

not
collect
information
on
depart,
depth,

water1,
snowfall,
sunset
and
sunrise.
These

structurally
missing
values
are
denoted
by

“M,”
“T”,
or
“-‐“.
“In
other
cases,
the
value

cannot
or
was
not
determined
at
the
time
of

the
model
building”
(Kuhn
&
Johnson,
p.41).

Examples
of
such
missing
values
are
tavg,

wetbulb,
heat,
cool,
preciptotal,
stnpressure,

sea
level,
time
[584
values]
and
average

speed.
Hence,
the
spray
data
and
the
weather

data
do
contain
missing
values.

The
missing
value
for
the
time
data
set
is

“concentrated
in
a
subset
of
predictors”
(Kuhn

&
Johnson,
p.41).
In
other
words,
the
584

missing
values
pertaining
to
the
spray
data

relates
to
09/07/2011
where
time
has
not

1
The fields for the datasets can be found in
Table 1 in the appendix titled “Data Fields”.
been
recorded
after
7:44:32
PM
and
before

7:46:30
PM.
The
non-‐structurally
missing
data

values
for
the
weather
dataset,
however,

appear
to
occur
randomly
across
all
the

predictors.

The
counts
of
missing
values
for

each
of
the
predictor
variables
have
been

tabulated
below.

The
response
variables
are
the
two
classes

that
the
model
aims
to
predict
namely
the

presence
or
absence
of
the
West
Nile
Virus
[1,

0].

The
explanatory
variables
are:
maximum

temperature,
minimum
temperature,
average

temperature,
precipitation,
result
wind
speed,

result
wind
direction,
species,
trap,
longitude,

latitude,
number
of
mosquitoes
and
address.

EXTERNAL
DATASETS

Although
Kaggle
already
provides
a
number
of

explanatory
variables
for
the
West
Nile
Virus

challenge,
there
are
ample
opportunities
to

include
external
datasets
that
may
contain

other
variables
that
can
improve
a
predictive

model’s
performance.
For
example,
Ruiz
et
al

(2010)
found
that
the
amount
of
vegetation

and
the
degree
to
which
water
would
flow
or

remain
in
an
area
mediated
the
effect
of

weather
in
predicting
the
infection
rate
of

West
Nile
Virus.
Socioeconomic
factors
that

measured
poverty
also
seemed
to
correlate

with
the
presence
of
West
Nile
Virus.
Bringing

in
additional
data
from
reliable
government

sources
that
reflect
the
aforementioned

3

factors
will
help
us
finely
tune
our
predictive

models.

MULTIVARIATE
ANALYSIS

The
main
objective
of
a
multivariate
analysis

is
to
use
multiple
data
mining
techniques
to

study
how
variables
relate
to
one
another.

This
method
of
analysis
is
most
often
used

when
the
dataset
contains
more
than
one

explanatory
or
response
variable
or
even

both.
Kaggle’s
West
Nile
Virus
dataset

contains
one
response
variable
and
12

explanatory
variables.

Using
a
multivariate
analysis
for
such
a

dataset
is
desirable
because
the
final
outcome

of
accurately
predicting
the
presence
or

absence
of
WNV
might
be
influenced
by
more

than
one
attribute.
For
instance,
principal

component
analysis
can
be
used
to

“decompose
a
data
table
with
correlated

measurements
into
a
new
set
of
uncorrelated

(i.e.,
orthogonal)
variables”
(Abdi,
p.1).

Performing
PCA
will
determine
the
dominant

trends
in
the
dataset
upon
which,
for
example,

a
logistic
regression
model
can
be
applied.

Conducting
a
logistic
regression
alone
with
12

explanatory
variables
may
not
produce
a

stable
model
if
there
is
a
strong
dependence

between
predictors.
PCA
addresses
the
issue

of
multicollinearity
resulting
in
a
regression

model
that
accurately
estimates
the
response

variable.
Therefore,
the
advantages
and

disadvantages
of
using
one
technique
in

conjunction
with
another
in
light
of
the

number
of
explanatory
variables
offers
a

purpose
to
use
multivariate
analysis.

DATA
COLLECTION

The
dataset
provided
by
the
Chicago

Department
of
Public
health
and
NOAA

[National
Oceanic
and
Atmospheric

Administration]
comprises
of
weather
data2,

GIS
data3,
date
of
traps
set
[spanning
3
days

each
week
for
approximately
5
months],

location
of
traps
and
species
for
the
years

between
2007
and
2014.
The
main
dataset
is

broken
into
two
sets
of
data
that
is
the

training
and
the
testing
dataset.
The
training

dataset
reflects
data
points
collected
for
the

odd
years:
2007,
2009,
2011
and
2013.

Whereas,
the
testing
dataset
consists
of
data

points
gathered
for
the
even
years:
2008,

2010,
2012
and
2014.

There
are
two
central
factors
that
serve
as
the

premise
for
when
and
why
the
WNV
data
was

collected.
The
first
factor
is
weather.
“It
is

believed
that
hot
and
dry
conditions
are
more

favorable
for
West
Nile
virus
than
cold
and

wet.”
(Kaggle,
information
description,
para.

9)
Therefore,
the
dataset
captures
information

about
weather
[from
station
1
–
Chicago

O’Hare
International
Airport
–
and
station
2
–

Chicago
Midway
International
Airport]
only

for
the
months
of
late
May
through
early

October.
The
second
factor
is
the
availability

of
data
for
the
number
of
mosquitos’
trapped,

location,
species
identified
and
the
test
results

of
the
presence
or
absence
of
the
West
Nile

virus.
“Every
year
from
late-‐May
to
early-‐
October,
public
health
workers
in
Chicago

setup
mosquito
traps
scattered
across
the
city.

Every
week
from
Monday
through

Wednesday,
these
traps
collect
mosquitos,
and

the
mosquitos
are
tested
for
the
presence
of

West
Nile
virus
before
the
end
of
the
week.”

(Kaggle,
information
description,
para.
3)

It
is
no
coincidence
that
traps
are
only
set
out

in
late
spring
through
early
fall
when
the

weather
is
conducive
to
the
population

growth
in
mosquitos.
Identifying
the
location

2
Weather data has been collected only for dates
on which the traps were set
3
GIS data for spraying is only available from
2011 to 2013,

4

of
the
traps,
the
number
of
mosquitos’

trapped,
the
species,
and
the
frequencies
of

each
species
infected
or
not
infected
with
the

virus
in
conjunction
with
weather
is
crucial
in

understanding
where
the
next
sporadic

growth
of
the
mosquitos
will
occur.
After
all,

the
goal
of
the
predictive
model
is
to
identify

the
presence
or
absence
of
the
WNV
by

predicting
the
occurrence
and
the
rate
of

mosquito
growth
in
one
particular
location

over
another
given
a
set
of
weather

conditions.
Such
predictions
can
be
used
by

the
City
of
Chicago
and
CPHD
“to
efficiently

and
effectively
allocate
resources”
to
control

the
population
growth
of
mosquitos
which
in

turn
prevents
the
transmission
of
the

“potentially
deadly
virus.”

DATA
MERGING

The
West
Nile
training
dataset
does
not

contain
the
weather
variables
required
for
a

robust
analysis.
Therefore,
the
weather

dataset
has
been
merged
with
the
train
file

resulting
in
a
merged
file
titled

“wnv.train.weather.”
The
unique
identifier

used
to
merge
both
files
are
date
and
station.

Since
the
NOAA
Weather
dataset
provides

weather
data
from
two
weather
stations

located
in
the
Greater
Chicago
Area,
the

distance
was
calculated
from
the
site
of

individual
traps
to
each
of
the
two
weather

stations
and
was
used
to
select
the

appropriate
weather
information
for
each

training
record
based
on
the
proximity
of
the

two
weather
stations.
Two
distance
metrics

were
considered:
1)
Euclidean
distance

formula,

𝐷 = (𝑙𝑎𝑡!"#"$%& − 𝑙𝑎𝑡!"#$)! + (𝑙𝑜𝑛𝑔!"#"$%& − 𝑙𝑜𝑛𝑔!"#$)!

as
well
as
2)
Haversine
formula

(http://en.wikipedia.org/wiki/Haversine_for
mula)
when
taking
into
account
the
curvature

of
the
Earth,

The
“geosphere”
R
package
was
used
to

calculate
the
Haversine
formula
for
distance.

NEW
FEATURES

Ruiz
et
al.
(2010)
reported
the
importance
of

temporal
characteristics
of
weather
in

predicting
infection
rates
of
WNV
in
Northern

Illinois.
For
example,
they
found
a
positive

correlation
at
1
to
3
week
lags
between

precipitation
and
infection
rates.
Based
on
this

research
new
features
were
created
to

capture
this
information
in
the
weather

dataset,
namely
a
2
week
moving
average
of

precipitation
as
well
as
a
2
week
moving
sum

of
accumulated
rainfall.

Also,
time-‐based
effects
of
temperature
was

explored
and
this
entailed
the
use
of
a
metric

known
as
growing
degree
days
(GDD)
to

measure
heat
accumulation
used
to
predict

mosquito
development
rates.
GDD
was

calculated
as

𝐺𝐷𝐷 =
𝑇!"#$ − 𝑇!"#$, 𝑖𝑓 𝑇!"#$ > 𝑇!"#$
0, 𝑖𝑓 𝑇!"#$ ≤ 𝑇!"#$

where
Tbase
represents
a
threshold

temperature
where
an
organism’s
growth
rate

is
near
zero.
From
reviewing
literature,
Tbase

can
range
between
13°C
and
33°C.
We
will

vary
Tbase
and
observe
the
threshold
value
that

yields
the
best
performing
model.

Other
features
that
were
created
from
the

base
training
data
include
the
specific
week

number
of
a
year.
It
is
expected
that
the

abundance
of
mosquitos
and
consequently,

the
presence
of
WNV,
to
be
more
prevalent

during
certain
times
of
the
year.
Therefore
it

5

is
surmised
that
the
week
number
will
be

important
in
predicting
the
timing
of
WNV.

CATEGORICAL
VARIABLES

Dealing
with
categorical
variables
can
pose

certain
limitations.
For
example,
if
a
variable

in
a
given
data
set
contains
several
categories

there
arises
a
need
to
re-‐categorize
the
classes

into
smaller
groups
for
the
sake
of
simplicity

and
the
robustness
of
the
predictive
model.
In

addition,
depending
on
the
data
mining

technique
used
the
need
to
use
numerical
data

than
categorical
data
becomes
eminent.

The
categorical
variables
found
in
the
WNV

dataset
have
undergone
transformations
in

the
form
of
re-‐categorization.
For
instance,

variable
species
is
categorical
with
seven

classes
as
indicated
in
the
table
below:

Table
1
Species

However,
table
1
species
indicates
that
3

species
specifically
have
been
tested
positive

for
WNV.
Re-‐categorization
highlights
the

importance
of
the
three
classes
associated

with
WNV
leaving
the
other
four
classes
to
be

grouped
in
a
category
of
its
own
indicative
of

the
lack
of
attribution
to
the
spread
of
WNV4.

It
is
also
important
to
note
that
the
training

set
has
a
class
titled
“uncategorized.”
By

creating
the
fourth
category
called
“Culex

Other”
the
issue
of
the
unidentified
species
is

addressed
effectively.

4
Table 2 titled Species 2 contains the new
groupings

The
re-‐categorization
approach
has
been

applied
to
the
variable
date
as
well.

EXPLORARTORY
DATA
ANALYSIS

One
of
the
prime
focus
of
an
exploratory
data

analysis
is
to
check
whether
the
specific

characteristic(s)
of
a
data
set
meets
the

requirements
of
the
modeling
technique(s)
to

be
used
as
some
models
maybe
sensitive
to

certain
types
of
data.

That
is,
how
is
the
data

set
distributed?

Skewedness
of
a
distribution
whether
it
is

positive
or
negative
is
often
a
result
of
a

“subset
of
observations
that
appear
to
be

inconsistent
with
the
remaining
observations

that
follow
a
hypothesized
distribution.”
(Sim

et
al,
2005,
pg.642).
Histograms
and
box
plots

are
graphical
tools
widely
used
to
inspect
the

data
for
the
presence
of
outliers.
There
are

two
important
questions
to
address
after

visually
inspecting
the
boxplot:
first,
is
it

possible
for
the
boxplot
to
incorrectly
declare

certain
points
as
outliers.
Second,
does
the

presence
of
outliers
imply
the
need
for
a

transformation?

The
graphical
representation
of
the
box
plots5

for
the
West
Nile
dataset
has
identified
certain

variables
to
be
skewed
with
the
presence
of

outliers.
For
instance,
the
distribution
of
the

number
of
mosquitos
is
right
skewed.
The

5
All
histograms
and
box
plots
with
short

description
of
shape,
center
and
spread
for
the

WNV
data
set
can
be
found
in
the
appendix.

6

distribution
being
pulled
to
the
right
by
the

largest
number
in
the
data
set
for
the

respective
column.
The
IQR6
rule
for
outliers

indicates
that
values
lying
below
-‐20
and

above
39.5
are
potential
outliers.
On

examining
the
number
of
mosquitos
trapped

for
each
species
it
is
apparent
that
class

imbalance
plays
an
important
role
in
the

skewedness
of
the
data
as
shown
in
Table
2.

Table
2:
Number
of
Mosquitos
Trapped

All
numbers
above
39.5
represent
the
species

attributed
to
the
WNV
and
the
location
where

it
abounds.
There
exists
a
pattern
between
the

type
of
species,
the
location
and
the
number
of

mosquitos
trapped
that
is
beyond
the
scope
of

the
boxplot.

Similarly
the
boxplot
for
most
of
the
weather

variables
in
the
WNV
dataset
shows
the

presence
of
outliers.
However,
yearly,

monthly,
weekly
and
daily
variations
in

weather
are
infinite
and
the
differences
in

data
points
for
station
1
and
2
can
be
due
to

the
geographical
locations
of
the
stations

and/or
the
way
in
which
the
instruments

record
the
temperatures.

The
Natural
Resources
Management
and

Environment
Department
furthers
this

argument
by
stating
that
“weather
data

collected
at
a
given
weather
station
during
a

period
of
several
years
may
be
not

homogeneous,
i.e.,
the
data
set
representing
a

particular
weather
variable
may
present
a

6
The
appendix
contains
a
table
titled
“Lower
and

Upper
Bound
Outliers”

sudden
change
[from
one
weather
station
to

another].
This
phenomenon
may
occur
due
to

several
causes,
some
of
which
are
related
to

changes
in
instrumentation
and
observation

practices,
and
others,
which
relate
to

modification
of
the
environmental
conditions

of
the
site”
or
even
“change
in
the
time
of
the

observations.”
(para.14)

Thus,
the
skewedness
of
the
distribution
is
not

necessarily
a
consequence
of
extreme
data

points.
However,
it
is
a
result
of
class

imbalance.
For
instance,
the
histogram
for
the

accumulated
degree
day
shows
that

distribution
is
skewed
to
the
right.
But
when

the
histogram
is
constructed
taking
into

consideration
the
presence
or
absence
of
WNV

it
becomes
clear
that
imbalanced
class
is
the

root
of
the
skewedness
as
seen
in
the

histograms
below:

The
histograms
show
that
there
are
no

wnvpresent
at
lower/higher
degree
days.

However,
the
histograms
for
acc.deg.day
when

wnvpresent
=
0
or
1
and
0
appears
to
be
more

flat.
In
order
to
remove
distribution
skewness

the
data
points
was
replaced
by
the
square

root.
Thus
resulting
in
a
data
that
is
better

behaved
than
in
its
original
units.

7

In
addition
to
skewness,
another
factor
that

affects
the
predictive
capability
of
a
model
is

the
presence
of
outliers.
As
noted
earlier,
the

weather
data
consists
of
outliers.
“For
a
large

dataset,
removal
of
samples
based
on
missing

values
is
not
a
problem,
assuming
the

missingness
is
not
informative”
(Kuhn
&

Johnson,
2013,
p.41).
However,
a
more
robust

way
of
handling
missing
information
is
by

imputation.

“Imputation
is
layer
of
modelling

where
missing
values
are
estimated
based
on

other
predictor
variables.
This
amounts
to
a

predictive
model
within
a
predictive
model”

(Kuhn
&
Johnson,
2013,
p.42).

Missing
values
in
the
weather
data
set
have

been
addressed
by
the
implementation
of
hot

deck
imputation
where
each
missing
value
is

replaced
with
an
observed
value
from
a

similar
unit.
“An
attractive
feature
of
the
hot

deck
imputation
is
that
only
plausible
values

can
be
imputed
since
values
come
from

observed
responses
in
the
donor
pool”

(Andridge
&
Little,
2011,
para.
3)
which

means
that
the
weather
data
is
more
likely
to

be
similar
to
the
other
data
points
than

imputing
averages.
The
second
advantage
of

using
hot
deck
imputation
is
that
the
“method

does
not
rely
on
model
fitting
for
the
variable

to
be
imputed
and
thus
is
potentially
less

sensitive
to
model
misspecification
than
an

imputation
method
based
on
a
parametric

method
such
as
regression
imputation”

(Andridge
&
Little,
2011,
para.
3).

CORRELATION
ANALYSIS

There
are
specific
variables
in
the
dataset
that

reveal
interesting
patterns
such
as
the

number
of
mosquitos,
temperature
and

precipitation.

The
goal
of
the
correlation
analysis
was
to
plot

or
capture
a
trend
that
would
explain
the

relationship
between
the
variables
and
the

presence
of
the
West
Nile
Virus.
Since
the

variables
are
on
different
scales
the
variables

were
normalized
using
the
Z
score
formula.
In

addition
to
normalizing
the
data,
average

values
of
the
said
variables
were
considered

in
building
the
plots.

The
plots
pertain
to
weekly
records
captured

for
4
years:
2007,
2009,
2011
and
2013
for
the

months
between
late
May
and
early
October.

Individual
plots
have
been
drawn
for
each

year.

The
blue
line
shows
the
average
precipitation.

The
red
line
shows
the
average
number
of

mosquitos,
the
green
line
shows
the
average

temperature
and
the
purple
line
shows
the

presence
of
the
virus.

Figure
1:
2007

According
to
the
line
graph
for
the
year
2007,

a
sudden
decrease
in
temperature
causes

mosquitos
to
decrease
after
week
35.

Consequently,
the
average
number
of
detected

virus
decreases.

It
was
also
noted
that
the
higher
the

temperature
and
the
precipitation
gets,
the

higher
the
number
of
mosquitos
and

subsequently
the
higher
the
probability
for

the
presence
of
the
West
Nile
virus.

An
interesting
pattern
was
found
between

precipitation
and
the
increase
in
the
number

8

of
mosquitos.

The
increase
in
the
number
of

Figure
2:
2009

mosquitos
occurs
rapidly
not
during
the
week

of
high
precipitation
but
in
the
week
after.

It

appears
that
once
the
numbers
of
mosquitos’

increase.
Then
the
virus
infects
the
mosquitos.

The
number
of
mosquitos
in
week
35
is
low.

However,
the
graph
shows
that
the
presence

of
the
virus
is
prominent
than
before

indicating
that
all
of
the
mosquitos
have
the

virus
in
their
blood
although
the
mosquito

population
is
small.

Not
surprisingly,
as
the
temperature
declines

rapidly
[even
with
high
precipitation],
the

number
of
mosquitos
and
the
presence
of

WNV
drops.

All
plots
have
captured
similar

trends.

Figure
3:
2011

Figure
4:
2013

The
scatterplots
below
shows
that
the
number

of
mosquitos
and
the
presence
of
WNV
has
a

positive
relationship
with
dmonth,
dweek,

dewpoint,
cool,
tmax,
tmin,
tavg
and
spray.

Therefore,
the
model
will
certainly
rely
on

these
features
more
than
the
others
to
predict

WNV.

Though
the
relationships
are
positive
the

strength
however,
appears
to
be
weak.
A

closer
look
at
the
scatterplots
shows
some

evidence
of
multicolinearity.
For
instance,
in

the
plot
titled
temp
and
weather
there
are

blocks
of
strong
positive
correlations
that

indicate
colinearity.

An
issue
to
consider
in

the
modeling
process.

MODELS

Accurately
predicting
the
presence
of
WNV

essentially
amounts
to
selecting
the
best

spatial,
temporal
and
weather
features
along

with
a
specifically
tuned
classification

algorithm.
It
is
evident
from
the
exploratory

analysis
as
well
as
from
literature
that
certain

individual
features
are
crucial
in
predicting

WNV.

Therefore,
the
modeling
process
for
this
data

set
will
be
broken
into
two
parts.
Part
I,
will

focus
on
determining
how
to
best
incorporate

the
available
features
into
a
classification

model.

Part
II,
will
focus
on
investigating
and

9

fine
tuning
the
specific
classification

algorithms
to
yield
the
best
possible

prediction.

Part
I

Weather
Data
and
Principal
Component

Analysis

Due
to
the
number
of
weather
attributes

available
to
the
researcher
in
the
dataset,
it

becomes
quite
difficult
to
ascertain
the

combination
that
will
result
in
the
best
model.

Moreover,
the
nature
of
weather
is
such
that

most
individual
features
will
be
correlated
to

another
resulting
in
multicolinearity.
For

example,
the
amount
of
precipitation
will
be

correlated
to
atmospheric
pressure
and
in

turn,
be
correlated
to
temperature.

Therefore

to
combat
multicolinearity
principal

component
analysis
(PCA)
was
used
to
extract

features
that
highlight
the
similarities
and

differences
of
the
original
weather
data
while

eliminating
the
detrimental
effects
that
can

result
from
the
linear
dependency
of
predictor

variables.

Figure
5
summarizes
the
results
of
PCA

conducted
on
the
weather
attributes.
The
first

five
components
capture
97%
of
the
variation

in
the
weather
data.
The
loadings
of

component
1
suggest
it
is
highly
related
to

temperature,
humidity
and
pressure;
a
large

value
for
component
1
seems
to
represent
a

sunny
but
chilly
day.
Component
2
appears
to

capture
wind
information,
while
component
3

summarizes
precipitation.
The
first
5

components
from
PCA
will
be
used
to
reflect

the
weather
conditions
of
a
specific
day
in
the

data.

Figure
5:
PCA

Figure
6:
Clustering

10

Figure
7:
Model
Summary

Temporally
based
weather
variables
and
week

number

While
the
weather
conditions
of
a
specific
day

can
affect
the
activity
level
of
mosquitos
for

that
day,
it
does
not
take
into
account
a

mosquito’s
life-‐cycle
or
the
timing
of
weather

conditions
and
its
effect
on
mosquito

populations.
Hence,
engineered
features
such

as
growing
degree
day,
moving
temperature

averages/sums
and
moving
precipitation

averages/sums
(all
mentioned
in
previous

sections)
will
be
included
in
the
model.

Also,
week
numbers
of
the
year
will
be

incorporated
to
capture
the
inter-‐annual

timing
of
mosquito
populations.

Clustering
Location
Data

Determining
a
good
way
to
represent
location

will
most
likely
improve
the
predictive
power

of
the
models.
Although,
the
WNV
challenge

provides
raw
longitude
and
latitude
values
to

represent
location,
it
is
believed
to
not
be
in
a

form
that
will
be
conducive
to
predictive

modeling
due
to
the
non-‐linear
nature
of

spatial
data.

Thus
k-‐means
algorithm
(k
=
20)
was
used
to

translate
the
location
data
represented
by

longitude/
latitude
pairs
into
clustered

locations.
Figure
6
shows
the
location
of
the

clusters
using
a
normalized
scale.

As
one
can
observe,
the
clustered
locations

outline
the
Chicago
area
quite
accurately.

These
clustered
locations
will
be
used
as
a

categorical
variable
in
our
models.

Part
II

With
the
necessary
data
pre-‐processing
and

variable
transformations
completed.
The

focus
was
moved
onto
the
construction
of

models
to
predict
WNV.
The
overall
approach

was
to
build
an
ensemble,
a
model
that
takes
a

weighted
average
of
a
set
of
classifiers
that

generally
outperforms
the
individual

classifiers
upon
which
the
ensemble
is
built

from.
The
strategy
was
to
consider
five

individual
algorithms
and
build
the
best

possible
classifier
out
of
each
to
include
in
the

final
ensemble
model:
1)
logistic
regression

using
a
generalized
additive
model
(GAM),
2)

linear
discriminant
analysis
(LDA),
3)
random

forests,
and
4)
support
vector
machines

(SVM).
Kaggle’s
train
dataset
was
split
by
70%

and
30%
probabilities
where
the
70%
was

used
as
the
training
set
and
the
remaining

30%
served
as
the
hold
out
for
the
test

dataset.

Figure
7
is
a
summary
of
all
the
best
set-‐ups

for
each
algorithm.
Of
all
the
individual

models,
GAM
was
clearly
the
best
performing

with
an
AUC
value
of
0.8253717.
The
best

overall
ensemble
classifier
was
a
weighted

average
of
GAM
and
SVM
with
weights
of
0.6

and
0.4,
respectively,
and
an
AUC
of

0.8361962.

11

CONCLUSION

Although
the
ensemble
model
had
the
highest

AUC
value
achieved
in
the
training
dataset,
it

only
reached
an
AUC
of
0.6220
on
the
Kaggle

leaderboard.

In
fact,
over
50
models
were
submitted
to

Kaggle
and
the
results
were
rarely
as

expected.
The
two
best
models
on
the

leaderboard
consisted
of
an
ensemble
of
GAM

logistic
regression
and
GLM
logistic
regression

and
a
slightly
modified
Poisson
GLM
model.

Both
did
not
have
notable
training
AUCs
but

performed
well
on
Kaggle.

Other
validation
techniques
were
investigated

in
an
attempt
to
obtain
better
feedback
from

the
training
process
which
resulted
in
the

build
of
a
better
model.
Instead
of
using
a

70/30
training
and
testing
split,
a
modified

version
of
n-‐fold
cross
validation
was
used

where
one
year’s
data
was
left
out
as
testing

and
the
remaining
years
were
used
as

training.
This
process
was
repeated
four

times,
once
for
each
year,
and
this
averaged

the
model’s
performance.
The
best
models

achieved
from
this
validation
technique
did

not
seem
any
different
from
the
models
built

on
a
traditional
70/30
split.

Figure
8:
Models
&
Imbalance

Because
there
is
a
gross
imbalance
of
positive

and
negative
cases
in
the
WNV
data
further

examination
was
conducted
to
see
if
the

imbalance
had
any
influence
on
the

effectiveness
of
training
and
validation.
Figure

8
shows
the
performance
of
several
models

and
its
relationship
with
data
imbalance.

Except
for
one
model,
none
displayed
a
drastic

sensitivity
to
data
balance.

If
using
the
appropriate
validation
technique

does
not
account
for
the
disparity
between

training
AUC
and
the
Kaggle
leaderboard
AUC,

it
is
surmised
that
there
may
be
a
fundamental

difference
between
the
characteristics
of
the

training
data
and
testing
data.

Specifically,
it
is
possible
that
there
are

idiosyncratic
intra-‐annual
variations
in

weather
that
cannot
be
captured
in
the

training
set
due
to
how
the
WNV
problem
is

set
up.
Ezanno
et
al
(2014)
cites
that

population
of
certain
mosquito
species
does
in

fact
have
inter-‐annual
variations
due
to

specific
weather
events
in
a
year.

It
is
therefore
suspected,
that
the
best

algorithms
discussed
afore
are
over
fitting
the

training
data.
While
the
best
models
in
this

study
capture
the
variations
in
weather
in
the

training
data
well,
it
is
unable
to
replicate
this

in
the
testing
data.

This
intuitively
makes
sense
as
most
of
the

models
that
performed
better
on
Kaggle
tend

to
be
simple
models
that
included
variables

like
location,
week
number
and
mosquito

species
that
is
generalizable
through
all
years

of
the
data.

12

Other
matter
of

consideration
for
future

model
building
is
the

importance
of
the
spray

data.
Though
the
spray

data
is
not
a
part
of
the

testing
dataset
and
would

warrant
an
immediate

dismissal
from
the

predictor
selection

process,
the
following

heat
map
implies

otherwise.

Upon
close

inspection
of
the
heat

map
one
speculates
that

spraying
one
year
does

indeed
alter
the
effects
of

population
the
next
year,

which
might
explain
why
mosquito

populations
appear
in
different
locations
each

year.

Also,
feature
engineering
of
the
predictor

variable,
depart
[departure
from
normal],

might
help
in
creating
a
deeper
level
of

understanding
the
problem
statement
at
hand.

A
possible
means
of
engineering
this
predictor

would
be
to
categorize
the
deviance
from

temperature
normalcy
as
hotter
than
normal

and
colder
than
normal.

13

Appendix

Table
3:
Data
Fields

FIELDS

Number
Train
Weather
Spray
Test

1
Date
Station
Date
ID

2
Address
Date
Time
Date

3
Species
Max
Temperature

Latitude
Address

4
Block
Min
Temperature
Longitude
Species

5
Street
Avg
Temperature

Block

6
Trap
Departure
from
Normal

Street

7
Address
Number
Dew
Point

Trap

8
Latitude
Wet
Bulb

Address
Number

9
Longitude
Heat

Latitude

10
Address
Accuracy
Cool

Longitude

11
#
of
Mosquitoes
Sunrise

Address
Accuracy

12
Wnvpresent
Sunset

13

Code
Sum

14

Depth

15

Water1

16

Snowfall

17

Total
Precipitation

18

Station
Pressure

19

Sea
Level

20

Wind
Speed

21

Wind
Direction

22

Average
Speed

15

SKEWNESS OF VARIABLES & OUTLIERS
DATE PATTERN

The data is skewed to the
left. There are more records
for 2007 than other years
but not by a significant
amount. If this becomes
problematic, we may
sample equal number of
records for each year.

16

LATITUDE PATTERN

Shape:
Latitude
is
very
slightly

skewed
to
the
left.
Mean
is
less
than
the

median

Center:
41.84628

Spread: 41.64461 to 42.01743

17

LONGITUDE PATTERN

Shape:
Longitude
is
symmetric

Center:
-‐87.69499

Spread: -87.93099 to -87.53163

18

NUMBER OF MOSQUITOS PATTERN

Shape:
The
distribution
is
right

skewed
as
the
mean
is
12.85351

being
pulled
to
the
right
away
from

the
median
which
is
5

Center:
5

Spread: 1 to 50
Outlier: The boxplot confirms the
skewedness of the histogram in that
there are large numbers causing the
distribution to be pulled to the right.
The outlier function indicates the
largest number in the data for
number of mosquitos is 50

19

DISTANCE FROM O’HARE PATTERN

Shape:
The
distribution
is
symmetric

Center:
0.2943334

Spread: 0.0372549 to 0.5179756

20

DISTANCE FROM MIDWAY PATTERN

Shape:
The
distribution
is
slightly

skewed
to
the
left
as
the
mean

0.1548598
is
pulled
away
from
the

median
0.1616137

Center:
0.1616137

Spread: 0.0077139 to 0.2481943

21

MAXIMUM TEMPERATURE PATTERN

Shape:
The
distribution
is
s
skewed

to
the
left
as
the
mean
81.94765
is

pulled
away
to
the
left
from
the

median
83

Center:
83

Spread: 57 to 97
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 57 is the
point that is distant from the other
values in the dataset.

22

MINIMUM TEMPERATURE PATTERN

Shape:
The
distribution
is
s
skewed

to
the
left
as
the
mean
64.16533
is

pulled
away
to
the
left
from
the

median
66

Center:
66

Spread: 41 to 79

23

AVERAGE TEMPERATURE PATTERN

Shape:
The
distribution
is

skewed
to
the
left
as
the
mean

38.28412
is
pulled
away
to
the
left

from
the
median
40

Center:
40

Spread: 15 to 52
Outlier: The box plot shows
the presence of some points

24

TOTAL PRECIPITATION PATTERN

Shape:
The
distribution
is

skewed
to
the
right
as
the
mean

0.1274281
is
pulled
away
to
the

right
from
the
median
0

Center:
0

Spread: 0.00 to 3.97
distribution to the right. The outlier
function indicates that 3.97 is the

25

RESULT OF WIND SPEED PATTERN

Shape:
The
distribution
is

skewed

to
the
right
as
the
mean
5.911003
is

pulled
away
to
the
left
from
the

median
5.5

Center:
5.5

Spread: 0.1 to 15.4

26

RESULT OF WIND DIRECTION PATTERN

Shape:
The
distribution
is
skewed

to
the
left
as
the
mean
17.72016
is

pulled
away
to
the
left
from
the

median
19

Center:
19

Spread: 1 to 36

27

AVERAGE WIND SPEED PATTERN

Shape:
The
distribution
is

skewed
to
the
left
as
the
mean

123.4147
is
pulled
away
to
the
left

from
the
median
139

Center:
139

Spread: 3 to 177

28

TEMPERATURE MOVING AVERAGES - 1 WEEK PATTERN

Shape:
The
distribution
is
skewed

to
the
left
as
the
mean
72.5431
is

pulled
away
to
the
left
from
the

median
73.14286

Center:
73.14286

Spread: 53.14286 to 83.85714
function indicates that 53.14286 is
the point that is distant from the
other values in the dataset.

29

TEMPERATURE MOVING AVERAGES – 2 WEEK PATTERN

Shape:
The
distribution
is
skewed

to
the
left
as
the
mean
72.41439
is

pulled
away
to
the
left
from
the

median
73

Center:
73

Spread: 55.07143 to 82.76923
function indicates that 55.07143 is
the point that is distant from the
other values in the dataset.

30

MOVING AVGS OF PRECIPITATION – 1 WEEK PATTERN

Shape:
The
distribution
is
skewed
to

the
right
as
the
mean
0.1333564
is

pulled
away
to
the
right
from
the

median
0.07

Center:
0.07

Spread: -0.0000 to 1.42857
presence of some points influencing
the movement of the distribution to the
right. The outlier function indicates
that 1.42857 is the point that is distant
from the other values in the dataset.

31

MOVING AVGS OF PRECIPITATION – 2 WEEK PATTERN

Shape:
The
distribution
is
skewed
to

the
right
as
the
mean
0.130
is
pulled

away
to
the
right
from
the
median

0.085

Center:
0.085

Spread: 0.0007 to 0.76714
the movement of the distribution to
the right. The outlier function
indicates that 0.76714 is the point
that is distant from the other values in
the dataset.

32

MOVING SUM OF PRECIPITATION – 1 WEEK PATTERN

Shape:
The
distribution
is
skewed

to
the
right
as
the
mean
0.9432334
is

pulled
away
to
the
right
from
the

median
0.53

Center:
0.53

Spread: -0.000 to 9.149

33

MOVING SUM OF PRECIPITATION – 2 WEEK PATTERN

Shape:
The
distribution
is
skewed
to

the
right
as
the
mean
1.74216
is
pulled

away
to
the
right
from
the
median
1.1

Center:
1.1

Spread: -0.000 to 10.74999
the movement of the distribution to
the right. The outlier function
indicates that 10.75 is the point that
is distant from the other values in the
dataset.

34

DEGREE DAY PATTERN

Shape:
The
distribution
is
skewed

to
the
right
as
the
mean
3.824472
is

pulled
away
to
the
right
from
the

median
3.4

Center:
3.4

Spread: 0.0 to 14.9

35

ACCUMULATED DEGREE DAY FOR EACH YEAR PATTERN

Shape:
The
distribution
is

skewed
to
the
right
as
the
mean

241.0934
is
pulled
away
to
the

right
from
the
median
239.6

Center:
239.6

Spread: 1.3 to 521.1

36

LOWER & UPPER BOUND OUTLIERS

37

GROUPED LINE GRAPH | YEAR 2007

Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus

38


Green line: The average temperature. Purple line: The presence of virus.

39



40



41

Works Cited
Abdi, Herve. Multivariate analysis. Retrieved from
www.utdallas.edu/~herve/Abdi-MultivariateAnalysis-pretty.pdf
Andridge & Little. (2011). A review of hot deck imputation for survey non – response
Int Stat Rev. 78(1): 40-64. Retrieved from
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/
Ezanno, P, Aubry-Kientz, M et al. (2015). A generic weather driven model to predict
Mosquito population dynamics applied to species of anopheles, culex
And aedes genera of southern France. 120(1): 39-50. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/25623972
Kaggle. West Nile Prediction. Retrieved from: https://www.kaggle.com/c/predict-
west-nile-virus/data
Kuhn & Johnson (2013). Applied Predictive Modeling. New York, Springer.
Natural Resources Management and Environmental Departments. Annex 4:
Statistical Analysis of Weather Data Sets 1. Retrieved from:
http://www.fao.org/docrep/x0490e/x0490e0l.htm#TopOfPage
Ruiz, Marilyn O., F Chavez Luis et al. (2010). Local impact of temperature and
precipitation on west Nile virus infection in culex species mosquitoes
in northeast Illinois, USA. Parasites & Vectors. Retrieved from
http://www.parasitesandvectors.com/content/3/1/19.
Ruiz, Marilyn 0., Edward D. Walker et al.(2007). Association of west nile virus
illness and urban landscapes in Chicago and Detroit. International
Journal of Health Geographics.
Theophilidies, C.N., S.C. Ahearni et al. (2006). First evidence of west nile virus
amplification and relationship to human infections. International
Journal of Geographical Information Science, 20, 103 -115.
Sim, C.H, Gan, F. F. et al (2005), Outlier: labeling with boxplot procedures.
Journal of American Statistical Association, 100(470).
Retrieved from: http://www.jstor.org/stable/27590584

MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS

Recommended

Recommended

More Related Content

Similar to MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS

Similar to MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS (20)

MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS