SlideShare a Scribd company logo
1 of 41
Download to read offline
  1	
  
MULTIVARIATE	
  ANALYSIS	
  
FARZAD	
  ESKANDANIAN,	
  MAX	
  LI,	
  JOYCE	
  ROSE,	
  NASIM	
  SONBOLI	
  	
  
CSC	
  424	
  |	
  ADVANCED	
  DATA	
  ANALYSIS	
  
6|14|2015	
  
	
  
	
  
	
  
The	
  purpose	
  of	
  this	
  paper	
  is	
  to	
  discuss	
  the	
  model(s)	
  used	
  in	
  predicting	
  the	
  presence	
  or	
  absence	
  of	
  the	
  
West	
  Nile	
  virus	
  [WNV].	
  	
  The	
  uniqueness	
  of	
  this	
  multivariate	
  analysis	
  is	
  the	
  use	
  of	
  weather,	
  temporal	
  
and	
  spatial	
  factors	
  based	
  on	
  the	
  premise	
  of	
  time	
  based	
  effects.	
  That	
  is,	
  the	
  models	
  built	
  take	
  into	
  
account	
  the	
  developmental	
  stages	
  of	
  a	
  mosquito.	
  Four	
  individual	
  classifiers	
  	
  -­‐	
  1)	
  logistic	
  regression	
  
using	
  a	
  generalized	
  additive	
  model	
  (GAM),	
  2)	
  linear	
  discriminant	
  analysis	
  (LDA),	
  3)	
  random	
  forests,	
  
and	
  4)	
  support	
  vector	
  machines	
  (SVM)	
  –	
  were	
  built	
  and	
  the	
  best	
  combinations	
  of	
  parameters	
  from	
  
each	
   model	
   was	
   included	
   in	
   the	
   ensemble	
   model.	
   Species,	
   week	
   number,	
   location,	
   moving	
  
temperature	
  averages,	
  precipitation	
  moving	
  averages	
  and	
  growing	
  degree	
  days	
  played	
  an	
  important	
  
role	
  in	
  predicting	
  WNV.	
  The	
  best	
  overall	
  ensemble	
  classifier	
  was	
  a	
  weighted	
  average	
  of	
  GAM	
  and	
  SVM	
  
with	
  weights	
  of	
  0.6	
  and	
  0.4,	
  respectively,	
  and	
  an	
  AUC	
  of	
  0.8361962	
  
	
  
	
  
INTRODUCTION	
  
	
  
	
  
The	
   west	
   Nile	
   Virus	
   (WNV)	
   is	
   “a	
   mosquito	
  
borne	
   disease-­‐causing	
   infectious	
   agent”	
  
(Theophilides	
  et	
  al,	
  2006,	
  para.	
  1)	
  that	
  affects	
  
birds,	
   humans,	
   and	
   animals.	
   In	
   1999,	
   WNV	
  
was	
  first	
  reported	
  in	
  the	
  United	
  States.	
  Since	
  
the	
   initial	
   occurrence	
   the	
   presence	
   of	
   WNV	
  
causing	
   seasonal	
   epidemics	
   have	
   been	
  
recorded	
   leading	
   to	
   a	
   series	
   of	
   research	
  
focused	
   on	
   understanding	
   the	
   features	
   and	
  
characteristics	
   of	
   the	
   virus.	
   The	
   research	
  
available	
   on	
   WNV	
   indicates	
   that	
   “the	
  
infections	
   caused	
   by	
   pathogens	
   by	
   way	
   of	
   a	
  
mosquito	
   vector	
   often	
   cluster	
   in	
   space	
   and	
  
time	
   given	
   the	
   habitat	
   requirements	
   of	
   the	
  
vectors	
   and	
   the	
   vertebrate	
   involved	
   in	
   the	
  
transmission.”	
  (Ruiz	
  et	
  al,	
  2007,	
  para	
  8).	
  	
  
In	
   other	
   words,	
   the	
   West	
   Nile	
   viral	
  
transmission	
   is	
   attributed	
   to	
   the	
   patterns	
   of	
  
climate,	
   landscape,	
   hydrology	
   and	
   types	
   of	
  
human	
   settlements.	
   Ruiz	
   et	
   al	
   (2010)	
   argue	
  
that	
   the	
   statistical	
   models	
   built	
   thus	
   far	
   by	
  
researchers	
   are	
   mere	
   reports	
   that	
   only	
  
characterize	
   associations	
   between	
   the	
   virus	
  
and	
   weather,	
   landscape,	
   human	
   density	
   etc.	
  
Though	
  they	
  offer	
  insights	
  about	
  the	
  WNV,	
  the	
  
associations	
   themselves	
   are	
   not	
   enough	
   to	
  
develop	
   and	
   implement	
   preventive	
   measures	
  
for	
  future	
  epidemics.	
  The	
  interesting	
  aspect	
  of	
  
the	
   WNV	
   challenge	
   arises	
   from	
   the	
   need	
   to	
  
build	
   a	
   better	
   model	
   that	
   takes	
   into	
   account	
  
the	
  life	
  cycle	
  of	
  the	
  mosquitoes	
  in	
  relationship	
  
to	
  the	
  variability	
  in	
  weather	
  and	
  its	
  impact	
  “on	
  
WEST	
  NILE	
  VIRUS	
  |	
  CHICAGO	
  
 
	
  
2	
  
growth	
   or	
   activity	
   of	
   an	
   organism.”	
   Such	
   a	
  
model	
  can	
  take	
  a	
  step	
  beyond	
  associations	
  and	
  
indicate	
  what	
  the	
  best	
  time	
  and	
  location	
  is	
  for	
  
early	
  intervention.	
  The	
  importance	
  of	
  building	
  
a	
   robust	
   model	
   with	
   predictive	
   capabilities	
  
lies	
  in	
  the	
  need	
  to	
  prevent	
  an	
  outbreak	
  in	
  the	
  
future.	
  Therefore	
  the	
  goal	
  of	
  this	
  project	
  is	
  to	
  
build	
  a	
  model	
  that	
  uses	
  weather,	
  temporal	
  and	
  
spatial	
  factors	
  to	
  predict	
  the	
  West	
  Nile	
  virus.	
  	
  
	
  
DATA	
  DESCRIPTION	
  
Kaggle’s	
  West	
  Nile	
  Virus	
  challenge	
  consists	
  of	
  
the	
  following	
  datasets1:	
  
Obs	
  
Train	
   Weather	
   Spray	
   Test	
  
10506	
   2944	
   14835	
   116293	
  
Var	
   12	
   22	
   4	
   11	
  
	
  	
  
The	
  datasets	
  contains	
  a	
  combination	
  of	
  string	
  
and	
  numeric	
  variables.	
  	
  
	
  
“In	
   many	
   cases,	
   some	
   predictors	
   have	
   no	
  
values	
  for	
  a	
  given	
  sample.	
  These	
  missing	
  data	
  
could	
   be	
   structurally	
   missing”	
   (Kuhn	
   &	
  
Johnson,	
   p.41).	
   For	
   instance,	
   station	
   2	
   does	
  
not	
   collect	
   information	
   on	
   depart,	
   depth,	
  
water1,	
   snowfall,	
   sunset	
   and	
   sunrise.	
   These	
  
structurally	
   missing	
   values	
   are	
   denoted	
   by	
  
“M,”	
   “T”,	
   or	
   “-­‐“.	
   “In	
   other	
   cases,	
   the	
   value	
  
cannot	
  or	
  was	
  not	
  determined	
  at	
  the	
  time	
  of	
  
the	
   model	
   building”	
   (Kuhn	
   &	
   Johnson,	
   p.41).	
  
Examples	
   of	
   such	
   missing	
   values	
   are	
   tavg,	
  
wetbulb,	
   heat,	
   cool,	
   preciptotal,	
   stnpressure,	
  
sea	
   level,	
   time	
   [584	
   values]	
   and	
   average	
  
speed.	
  Hence,	
  the	
  spray	
  data	
  and	
  the	
  weather	
  
data	
  do	
  contain	
  missing	
  values.	
  	
  
	
  
The	
   missing	
   value	
   for	
   the	
   time	
   data	
   set	
   is	
  
“concentrated	
  in	
  a	
  subset	
  of	
  predictors”	
  (Kuhn	
  
&	
   Johnson,	
   p.41).	
   In	
   other	
   words,	
   the	
   584	
  
missing	
   values	
   pertaining	
   to	
   the	
   spray	
   data	
  
relates	
   to	
   09/07/2011	
   where	
   time	
   has	
   not	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
1
The fields for the datasets can be found in
Table 1 in the appendix titled “Data Fields”.
been	
   recorded	
   after	
   7:44:32	
   PM	
   and	
   before	
  
7:46:30	
  PM.	
  The	
  non-­‐structurally	
  missing	
  data	
  
values	
   for	
   the	
   weather	
   dataset,	
   however,	
  
appear	
   to	
   occur	
   randomly	
   across	
   all	
   the	
  
predictors.	
   	
   The	
   counts	
   of	
   missing	
   values	
   for	
  
each	
   of	
   the	
   predictor	
   variables	
   have	
   been	
  
tabulated	
  below.	
  
	
  
	
  
	
  
	
  
The	
   response	
   variables	
   are	
   the	
   two	
   classes	
  
that	
   the	
   model	
   aims	
   to	
   predict	
   namely	
   the	
  
presence	
  or	
  absence	
  of	
  the	
  West	
  Nile	
  Virus	
  [1,	
  
0].	
  	
  
	
  
The	
   explanatory	
   variables	
   are:	
   maximum	
  
temperature,	
   minimum	
   temperature,	
   average	
  
temperature,	
  precipitation,	
  result	
  wind	
  speed,	
  
result	
  wind	
  direction,	
  species,	
  trap,	
  longitude,	
  
latitude,	
  number	
  of	
  mosquitoes	
  and	
  address.	
  	
  
EXTERNAL	
  DATASETS	
  
Although	
  Kaggle	
  already	
  provides	
  a	
  number	
  of	
  
explanatory	
  variables	
  for	
  the	
  West	
  Nile	
  Virus	
  
challenge,	
   there	
   are	
   ample	
   opportunities	
   to	
  
include	
   external	
   datasets	
   that	
   may	
   contain	
  
other	
  variables	
  that	
  can	
  improve	
  a	
  predictive	
  
model’s	
  performance.	
  For	
  example,	
  Ruiz	
  et	
  al	
  
(2010)	
   found	
   that	
   the	
   amount	
   of	
   vegetation	
  
and	
  the	
  degree	
  to	
  which	
  water	
  would	
  flow	
  or	
  
remain	
   in	
   an	
   area	
   mediated	
   the	
   effect	
   of	
  
weather	
   in	
   predicting	
   the	
   infection	
   rate	
   of	
  
West	
   Nile	
   Virus.	
   Socioeconomic	
   factors	
   that	
  
measured	
   poverty	
   also	
   seemed	
   to	
   correlate	
  
with	
  the	
  presence	
  of	
  West	
  Nile	
  Virus.	
  Bringing	
  
in	
   additional	
   data	
   from	
   reliable	
   government	
  
sources	
   that	
   reflect	
   the	
   aforementioned	
  
 
	
  
3	
  
factors	
  will	
  help	
  us	
  finely	
  tune	
  our	
  predictive	
  
models.	
  	
  
MULTIVARIATE	
  ANALYSIS	
  	
  
The	
  main	
  objective	
  of	
  a	
  multivariate	
  analysis	
  
is	
   to	
   use	
   multiple	
   data	
   mining	
   techniques	
   to	
  
study	
   how	
   variables	
   relate	
   to	
   one	
   another.	
  
This	
   method	
   of	
   analysis	
   is	
   most	
   often	
   used	
  
when	
   the	
   dataset	
   contains	
   more	
   than	
   one	
  
explanatory	
   or	
   response	
   variable	
   or	
   even	
  
both.	
   Kaggle’s	
   West	
   Nile	
   Virus	
   dataset	
  
contains	
   one	
   response	
   variable	
   and	
   12	
  
explanatory	
  variables.	
  	
  	
  
	
  
Using	
   a	
   multivariate	
   analysis	
   for	
   such	
   a	
  
dataset	
  is	
  desirable	
  because	
  the	
  final	
  outcome	
  
of	
   accurately	
   predicting	
   the	
   presence	
   or	
  
absence	
  of	
  WNV	
  might	
  be	
  influenced	
  by	
  more	
  
than	
   one	
   attribute.	
   For	
   instance,	
   principal	
  
component	
   analysis	
   can	
   be	
   used	
   to	
  
“decompose	
   a	
   data	
   table	
   with	
   correlated	
  
measurements	
  into	
  a	
  new	
  set	
  of	
  uncorrelated	
  
(i.e.,	
   orthogonal)	
   variables”	
   (Abdi,	
   p.1).	
  
Performing	
  PCA	
  will	
  determine	
  the	
  dominant	
  
trends	
  in	
  the	
  dataset	
  upon	
  which,	
  for	
  example,	
  
a	
  logistic	
  regression	
  model	
  can	
  be	
  applied.	
  	
  
	
  
Conducting	
  a	
  logistic	
  regression	
  alone	
  with	
  12	
  
explanatory	
   variables	
   may	
   not	
   produce	
   a	
  
stable	
   model	
   if	
   there	
   is	
   a	
   strong	
   dependence	
  
between	
   predictors.	
   PCA	
   addresses	
   the	
   issue	
  
of	
   multicollinearity	
   resulting	
   in	
   a	
   regression	
  
model	
  that	
  accurately	
  estimates	
  the	
  response	
  
variable.	
   Therefore,	
   the	
   advantages	
   and	
  
disadvantages	
   of	
   using	
   one	
   technique	
   in	
  
conjunction	
   with	
   another	
   in	
   light	
   of	
   the	
  
number	
   of	
   explanatory	
   variables	
   offers	
   a	
  
purpose	
  to	
  use	
  multivariate	
  analysis.	
  	
  
	
  
DATA	
  COLLECTION	
  
	
  
The	
   dataset	
   provided	
   by	
   the	
   Chicago	
  
Department	
   of	
   Public	
   health	
   and	
   NOAA	
  
[National	
   Oceanic	
   and	
   Atmospheric	
  
Administration]	
   comprises	
   of	
   weather	
   data2,	
  
GIS	
   data3,	
   date	
   of	
   traps	
   set	
   [spanning	
   3	
   days	
  
each	
   week	
   for	
   approximately	
   5	
   months],	
  
location	
   of	
   traps	
   and	
   species	
   for	
   the	
   years	
  
between	
  2007	
  and	
  2014.	
  The	
  main	
  dataset	
  is	
  
broken	
   into	
   two	
   sets	
   of	
   data	
   that	
   is	
   the	
  
training	
  and	
  the	
  testing	
  dataset.	
  The	
  training	
  
dataset	
   reflects	
   data	
   points	
   collected	
   for	
   the	
  
odd	
   years:	
   2007,	
   2009,	
   2011	
   and	
   2013.	
  
Whereas,	
   the	
   testing	
   dataset	
   consists	
   of	
   data	
  
points	
   gathered	
   for	
   the	
   even	
   years:	
   2008,	
  
2010,	
  2012	
  and	
  2014.	
  	
  
	
  
There	
  are	
  two	
  central	
  factors	
  that	
  serve	
  as	
  the	
  
premise	
  for	
  when	
  and	
  why	
  the	
  WNV	
  data	
  was	
  
collected.	
   The	
   first	
   factor	
   is	
   weather.	
   “It	
   is	
  
believed	
  that	
  hot	
  and	
  dry	
  conditions	
  are	
  more	
  
favorable	
   for	
   West	
   Nile	
   virus	
   than	
   cold	
   and	
  
wet.”	
   (Kaggle,	
   information	
   description,	
   para.	
  
9)	
  Therefore,	
  the	
  dataset	
  captures	
  information	
  
about	
   weather	
   [from	
   station	
   1	
   –	
   Chicago	
  
O’Hare	
  International	
  Airport	
  –	
  and	
  station	
  2	
  –	
  
Chicago	
   Midway	
   International	
   Airport]	
   only	
  
for	
   the	
   months	
   of	
   late	
   May	
   through	
   early	
  
October.	
   The	
   second	
   factor	
   is	
   the	
   availability	
  
of	
  data	
  for	
  the	
  number	
  of	
  mosquitos’	
  trapped,	
  
location,	
  species	
  identified	
  and	
  the	
  test	
  results	
  
of	
   the	
   presence	
   or	
   absence	
   of	
   the	
   West	
   Nile	
  
virus.	
   “Every	
   year	
   from	
   late-­‐May	
  to	
   early-­‐
October,	
   public	
   health	
   workers	
   in	
   Chicago	
  
setup	
  mosquito	
  traps	
  scattered	
  across	
  the	
  city.	
  
Every	
   week	
   from	
   Monday	
   through	
  
Wednesday,	
  these	
  traps	
  collect	
  mosquitos,	
  and	
  
the	
   mosquitos	
   are	
  tested	
   for	
   the	
   presence	
   of	
  
West	
   Nile	
   virus	
   before	
   the	
   end	
   of	
   the	
   week.”	
  
(Kaggle,	
  information	
  description,	
  para.	
  3)	
  
	
  
It	
  is	
  no	
  coincidence	
  that	
  traps	
  are	
  only	
  set	
  out	
  
in	
   late	
   spring	
   through	
   early	
   fall	
   when	
   the	
  
weather	
   is	
   conducive	
   to	
   the	
   population	
  
growth	
  in	
  mosquitos.	
  Identifying	
  the	
  location	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
2	
  Weather data has been collected only for dates
on which the traps were set
3
GIS data for spraying is only available from
2011 to 2013,	
  
 
	
  
4	
  
of	
   the	
   traps,	
   the	
   number	
   of	
   mosquitos’	
  
trapped,	
   the	
   species,	
   and	
   the	
   frequencies	
   of	
  
each	
  species	
  infected	
  or	
  not	
  infected	
  with	
  the	
  
virus	
  in	
  conjunction	
  with	
  weather	
  is	
  crucial	
  in	
  
understanding	
   where	
   the	
   next	
   sporadic	
  
growth	
  of	
  the	
  mosquitos	
  will	
  occur.	
  After	
  all,	
  
the	
  goal	
  of	
  the	
  predictive	
  model	
  is	
  to	
  identify	
  
the	
   presence	
   or	
   absence	
   of	
   the	
   WNV	
   by	
  
predicting	
   the	
   occurrence	
   and	
   the	
   rate	
   of	
  
mosquito	
   growth	
   in	
   one	
   particular	
   location	
  
over	
   another	
   given	
   a	
   set	
   of	
   weather	
  
conditions.	
   Such	
   predictions	
   can	
   be	
   used	
   by	
  
the	
   City	
   of	
   Chicago	
   and	
   CPHD	
   “to	
   efficiently	
  
and	
   effectively	
   allocate	
   resources”	
  to	
   control	
  
the	
  population	
  growth	
  of	
  mosquitos	
  which	
  in	
  
turn	
   prevents	
   the	
   transmission	
   of	
   the	
  
“potentially	
  deadly	
  virus.”	
  
	
  
DATA	
  MERGING	
  
	
  
The	
   West	
   Nile	
   training	
   dataset	
   does	
   not	
  
contain	
   the	
   weather	
   variables	
   required	
   for	
   a	
  
robust	
   analysis.	
   Therefore,	
   the	
   weather	
  
dataset	
   has	
   been	
   merged	
   with	
   the	
   train	
   file	
  
resulting	
   in	
   a	
   merged	
   file	
   titled	
  
“wnv.train.weather.”	
   The	
   unique	
   identifier	
  
used	
  to	
  merge	
  both	
  files	
  are	
  date	
  and	
  station.	
  	
  
	
  
Since	
   the	
   NOAA	
   Weather	
   dataset	
   provides	
  
weather	
   data	
   from	
   two	
   weather	
   stations	
  
located	
   in	
   the	
   Greater	
   Chicago	
   Area,	
   the	
  
distance	
   was	
   calculated	
   from	
   the	
   site	
   of	
  
individual	
   traps	
   to	
   each	
   of	
   the	
   two	
   weather	
  
stations	
   and	
   was	
   used	
   to	
   select	
   the	
  
appropriate	
   weather	
   information	
   for	
   each	
  
training	
  record	
  based	
  on	
  the	
  proximity	
  of	
  the	
  
two	
   weather	
   stations.	
   Two	
   distance	
   metrics	
  
were	
   considered:	
   1)	
   Euclidean	
   distance	
  
formula,	
  	
  
	
  
𝐷 = (𝑙𝑎𝑡!"#"$%& − 𝑙𝑎𝑡!"#$)! + (𝑙𝑜𝑛𝑔!"#"$%& − 𝑙𝑜𝑛𝑔!"#$)!	
  
	
  
as	
   well	
   as	
   2)	
   Haversine	
   formula	
  
(http://en.wikipedia.org/wiki/Haversine_for
mula)	
  when	
  taking	
  into	
  account	
  the	
  curvature	
  
of	
  the	
  Earth,	
  
	
  
	
  
	
  
The	
   “geosphere”	
   R	
   package	
   was	
   used	
   to	
  
calculate	
  the	
  Haversine	
  formula	
  for	
  distance.	
  	
  
	
  
NEW	
  FEATURES	
  
	
  
Ruiz	
  et	
  al.	
  (2010)	
  reported	
  the	
  importance	
  of	
  
temporal	
   characteristics	
   of	
   weather	
   in	
  
predicting	
  infection	
  rates	
  of	
  WNV	
  in	
  Northern	
  
Illinois.	
   For	
   example,	
   they	
   found	
   a	
   positive	
  
correlation	
   at	
   1	
   to	
   3	
   week	
   lags	
   between	
  
precipitation	
  and	
  infection	
  rates.	
  Based	
  on	
  this	
  
research	
   new	
   features	
   were	
   created	
   to	
  
capture	
   this	
   information	
   in	
   the	
   weather	
  
dataset,	
   namely	
   a	
   2	
   week	
   moving	
   average	
   of	
  
precipitation	
  as	
  well	
  as	
  a	
  2	
  week	
  moving	
  sum	
  
of	
  accumulated	
  rainfall.	
  	
  
	
  
Also,	
   time-­‐based	
   effects	
   of	
   temperature	
   was	
  
explored	
  and	
  this	
  entailed	
  the	
  use	
  of	
  a	
  metric	
  
known	
   as	
   growing	
   degree	
   days	
   (GDD)	
   to	
  
measure	
   heat	
   accumulation	
   used	
   to	
   predict	
  
mosquito	
   development	
   rates.	
   GDD	
   was	
  
calculated	
  as	
  
	
  
𝐺𝐷𝐷 =  
𝑇!"#$ − 𝑇!"#$,   𝑖𝑓  𝑇!"#$ >   𝑇!"#$
0,                                                                  𝑖𝑓    𝑇!"#$ ≤   𝑇!"#$
	
  
	
  
where	
   Tbase	
   represents	
   a	
   threshold	
  
temperature	
  where	
  an	
  organism’s	
  growth	
  rate	
  
is	
   near	
   zero.	
   From	
   reviewing	
   literature,	
   Tbase	
  
can	
   range	
   between	
   13°C	
   and	
   33°C.	
   We	
   will	
  
vary	
  Tbase	
  and	
  observe	
  the	
  threshold	
  value	
  that	
  
yields	
  the	
  best	
  performing	
  model.	
  	
  
	
  
Other	
   features	
   that	
   were	
   created	
   from	
   the	
  
base	
   training	
   data	
   include	
   the	
   specific	
   week	
  
number	
   of	
   a	
   year.	
   It	
   is	
   expected	
   that	
   the	
  
abundance	
   of	
   mosquitos	
   and	
   consequently,	
  
the	
   presence	
   of	
   WNV,	
   to	
   be	
   more	
   prevalent	
  
during	
  certain	
  times	
  of	
  the	
  year.	
  Therefore	
  it	
  
 
	
  
5	
  
is	
   surmised	
   that	
   the	
   week	
   number	
   will	
   be	
  
important	
  in	
  predicting	
  the	
  timing	
  of	
  WNV.	
  	
  
	
  
CATEGORICAL	
  VARIABLES	
  
	
  
Dealing	
   with	
   categorical	
   variables	
   can	
   pose	
  
certain	
  limitations.	
  For	
  example,	
  if	
  a	
  variable	
  
in	
  a	
  given	
  data	
  set	
  contains	
  several	
  categories	
  
there	
  arises	
  a	
  need	
  to	
  re-­‐categorize	
  the	
  classes	
  
into	
  smaller	
  groups	
  for	
  the	
  sake	
  of	
  simplicity	
  
and	
  the	
  robustness	
  of	
  the	
  predictive	
  model.	
  In	
  
addition,	
   depending	
   on	
   the	
   data	
   mining	
  
technique	
  used	
  the	
  need	
  to	
  use	
  numerical	
  data	
  
than	
  categorical	
  data	
  becomes	
  eminent.	
  	
  	
  	
  
	
  
The	
   categorical	
   variables	
   found	
   in	
   the	
   WNV	
  
dataset	
   have	
   undergone	
   transformations	
   in	
  
the	
   form	
   of	
   re-­‐categorization.	
   For	
   instance,	
  
variable	
   species	
   is	
   categorical	
   with	
   seven	
  
classes	
  as	
  indicated	
  in	
  the	
  table	
  below:	
  
	
  
Table	
  1	
  Species	
  
However,	
   table	
   1	
   species	
   indicates	
   that	
   3	
  
species	
   specifically	
   have	
   been	
   tested	
   positive	
  
for	
   WNV.	
   Re-­‐categorization	
   highlights	
   the	
  
importance	
   of	
   the	
   three	
   classes	
   associated	
  
with	
  WNV	
  leaving	
  the	
  other	
  four	
  classes	
  to	
  be	
  
grouped	
  in	
  a	
  category	
  of	
  its	
  own	
  indicative	
  of	
  
the	
  lack	
  of	
  attribution	
  to	
  the	
  spread	
  of	
  WNV4.	
  
It	
   is	
   also	
   important	
   to	
   note	
   that	
   the	
   training	
  
set	
   has	
   a	
   class	
   titled	
   “uncategorized.”	
   By	
  
creating	
   the	
   fourth	
   category	
   called	
   “Culex	
  
Other”	
  the	
  issue	
  of	
  the	
  unidentified	
  species	
  is	
  
addressed	
  effectively.	
  	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
4 	
  Table 2 titled Species 2 contains the new
groupings	
  
	
  
The	
   re-­‐categorization	
   approach	
   has	
   been	
  
applied	
  to	
  the	
  variable	
  date	
  as	
  well.	
  	
  
	
  
EXPLORARTORY	
  DATA	
  ANALYSIS	
  
	
  
One	
  of	
  the	
  prime	
  focus	
  of	
  an	
  exploratory	
  data	
  
analysis	
   is	
   to	
   check	
   whether	
   the	
   specific	
  
characteristic(s)	
   of	
   a	
   data	
   set	
   meets	
   the	
  
requirements	
  of	
  the	
  modeling	
  technique(s)	
  to	
  
be	
   used	
   as	
   some	
   models	
   maybe	
   sensitive	
   to	
  
certain	
  types	
  of	
  data.	
  	
  That	
  is,	
  how	
  is	
  the	
  data	
  
set	
  distributed?	
  
	
  
Skewedness	
   of	
   a	
   distribution	
   whether	
   it	
   is	
  
positive	
   or	
   negative	
   is	
   often	
   a	
   result	
   of	
   a	
  
“subset	
   of	
   observations	
   that	
   appear	
   to	
   be	
  
inconsistent	
  with	
  the	
  remaining	
  observations	
  
that	
  follow	
  a	
  hypothesized	
  distribution.”	
  (Sim	
  
et	
  al,	
  2005,	
  pg.642).	
  Histograms	
  and	
  box	
  plots	
  
are	
  graphical	
  tools	
  widely	
  used	
  to	
  inspect	
  the	
  
data	
   for	
   the	
   presence	
   of	
   outliers.	
   There	
   are	
  
two	
   important	
   questions	
   to	
   address	
   after	
  
visually	
   inspecting	
   the	
   boxplot:	
   first,	
   is	
   it	
  
possible	
  for	
  the	
  boxplot	
  to	
  incorrectly	
  declare	
  
certain	
   points	
   as	
   outliers.	
   Second,	
   does	
   the	
  
presence	
   of	
   outliers	
   imply	
   the	
   need	
   for	
   a	
  
transformation?	
  	
  	
  	
  
The	
  graphical	
  representation	
  of	
  the	
  box	
  plots5	
  
for	
  the	
  West	
  Nile	
  dataset	
  has	
  identified	
  certain	
  
variables	
   to	
   be	
   skewed	
   with	
   the	
   presence	
   of	
  
outliers.	
   For	
   instance,	
   the	
   distribution	
   of	
   the	
  
number	
   of	
   mosquitos	
   is	
   right	
   skewed.	
   The	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
5 	
  All	
   histograms	
   and	
   box	
   plots	
   with	
   short	
  
description	
   of	
   shape,	
   center	
   and	
   spread	
   for	
   the	
  
WNV	
  data	
  set	
  can	
  be	
  found	
  in	
  the	
  appendix.	
  	
  
 
	
  
6	
  
distribution	
   being	
   pulled	
   to	
   the	
   right	
   by	
   the	
  
largest	
   number	
   in	
   the	
   data	
   set	
   for	
   the	
  
respective	
   column.	
   The	
   IQR6	
  rule	
   for	
   outliers	
  
indicates	
   that	
   values	
   lying	
   below	
   -­‐20	
   and	
  
above	
   39.5	
   are	
   potential	
   outliers.	
   On	
  
examining	
   the	
   number	
   of	
   mosquitos	
   trapped	
  
for	
   each	
   species	
   it	
   is	
   apparent	
   that	
   class	
  
imbalance	
   plays	
   an	
   important	
   role	
   in	
   the	
  
skewedness	
  of	
  the	
  data	
  as	
  shown	
  in	
  Table	
  2.	
  
	
  
Table	
  2:	
  Number	
  of	
  Mosquitos	
  Trapped	
  
All	
  numbers	
  above	
  39.5	
  represent	
  the	
  species	
  
attributed	
  to	
  the	
  WNV	
  and	
  the	
  location	
  where	
  
it	
  abounds.	
  There	
  exists	
  a	
  pattern	
  between	
  the	
  
type	
  of	
  species,	
  the	
  location	
  and	
  the	
  number	
  of	
  
mosquitos	
  trapped	
  that	
  is	
  beyond	
  the	
  scope	
  of	
  
the	
  boxplot.	
  
	
  
Similarly	
  the	
  boxplot	
  for	
  most	
  of	
  the	
  weather	
  
variables	
   in	
   the	
   WNV	
   dataset	
   shows	
   the	
  
presence	
   of	
   outliers.	
   However,	
   yearly,	
  
monthly,	
   weekly	
   and	
   daily	
   variations	
   in	
  
weather	
   are	
   infinite	
   and	
   the	
   differences	
   in	
  
data	
  points	
  for	
  station	
  1	
  and	
  2	
  can	
  be	
  due	
  to	
  
the	
   geographical	
   locations	
   of	
   the	
   stations	
  
and/or	
   the	
   way	
   in	
   which	
   the	
   instruments	
  
record	
  the	
  temperatures.	
  	
  
	
  
The	
   Natural	
   Resources	
   Management	
   and	
  
Environment	
   Department	
   furthers	
   this	
  
argument	
   by	
   stating	
   that	
   “weather	
   data	
  
collected	
  at	
  a	
  given	
  weather	
  station	
  during	
  a	
  
period	
   of	
   several	
   years	
   may	
   be	
   not	
  
homogeneous,	
  i.e.,	
  the	
  data	
  set	
  representing	
  a	
  
particular	
   weather	
   variable	
   may	
   present	
   a	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
6	
  The	
  appendix	
  contains	
  a	
  table	
  titled	
  “Lower	
  and	
  
Upper	
  Bound	
  Outliers”	
  	
  
sudden	
   change	
   [from	
   one	
   weather	
   station	
   to	
  
another].	
  This	
  phenomenon	
  may	
  occur	
  due	
  to	
  
several	
   causes,	
   some	
   of	
   which	
   are	
   related	
   to	
  
changes	
   in	
   instrumentation	
   and	
   observation	
  
practices,	
   and	
   others,	
   which	
   relate	
   to	
  
modification	
  of	
  the	
  environmental	
  conditions	
  
of	
  the	
  site”	
  or	
  even	
  “change	
  in	
  the	
  time	
  of	
  the	
  
observations.”	
  (para.14)	
  
	
  
Thus,	
  the	
  skewedness	
  of	
  the	
  distribution	
  is	
  not	
  
necessarily	
   a	
   consequence	
   of	
   extreme	
   data	
  
points.	
   However,	
   it	
   is	
   a	
   result	
   of	
   class	
  
imbalance.	
  For	
  instance,	
  the	
  histogram	
  for	
  the	
  
accumulated	
   degree	
   day	
   shows	
   that	
  
distribution	
  is	
  skewed	
  to	
  the	
  right.	
  But	
  when	
  
the	
   histogram	
   is	
   constructed	
   taking	
   into	
  
consideration	
  the	
  presence	
  or	
  absence	
  of	
  WNV	
  
it	
   becomes	
   clear	
   that	
   imbalanced	
   class	
   is	
   the	
  
root	
   of	
   the	
   skewedness	
   as	
   seen	
   in	
   the	
  
histograms	
  below:	
  
	
  
	
  
	
  
	
  
The	
   histograms	
   show	
   that	
   there	
   are	
   no	
  
wnvpresent	
   at	
   lower/higher	
   degree	
   days.	
  
However,	
  the	
  histograms	
  for	
  acc.deg.day	
  when	
  
wnvpresent	
  =	
  0	
  or	
  1	
  and	
  0	
  appears	
  to	
  be	
  more	
  
flat.	
  In	
  order	
  to	
  remove	
  distribution	
  skewness	
  
the	
   data	
   points	
   was	
   replaced	
   by	
   the	
   square	
  
root.	
   Thus	
   resulting	
   in	
   a	
   data	
   that	
   is	
   better	
  
behaved	
  than	
  in	
  its	
  original	
  units.	
  	
  	
  
 
	
  
7	
  
In	
   addition	
   to	
   skewness,	
   another	
   factor	
   that	
  
affects	
  the	
  predictive	
  capability	
  of	
  a	
  model	
  is	
  
the	
  presence	
  of	
  outliers.	
  As	
  noted	
  earlier,	
  the	
  
weather	
  data	
  consists	
  of	
  outliers.	
  “For	
  a	
  large	
  
dataset,	
  removal	
  of	
  samples	
  based	
  on	
  missing	
  
values	
   is	
   not	
   a	
   problem,	
   assuming	
   the	
  
missingness	
   is	
   not	
   informative”	
   (Kuhn	
   &	
  
Johnson,	
  2013,	
  p.41).	
  However,	
  a	
  more	
  robust	
  
way	
   of	
   handling	
   missing	
   information	
   is	
   by	
  
imputation.	
  	
  “Imputation	
  is	
  layer	
  of	
  modelling	
  
where	
  missing	
  values	
  are	
  estimated	
  based	
  on	
  
other	
   predictor	
   variables.	
   This	
   amounts	
   to	
   a	
  
predictive	
   model	
   within	
   a	
   predictive	
   model”	
  
(Kuhn	
  &	
  Johnson,	
  2013,	
  p.42).	
  	
  
	
  
Missing	
   values	
   in	
   the	
   weather	
   data	
   set	
   have	
  
been	
  addressed	
  by	
  the	
  implementation	
  of	
  hot	
  
deck	
  imputation	
  where	
  each	
  missing	
  value	
  is	
  
replaced	
   with	
   an	
   observed	
   value	
   from	
   a	
  
similar	
  unit.	
  “An	
  attractive	
  feature	
  of	
  the	
  hot	
  
deck	
  imputation	
  is	
  that	
  only	
  plausible	
  values	
  
can	
   be	
   imputed	
   since	
   values	
   come	
   from	
  
observed	
   responses	
   in	
   the	
   donor	
   pool”	
  
(Andridge	
   &	
   Little,	
   2011,	
   para.	
   3)	
   which	
  
means	
  that	
  the	
  weather	
  data	
  is	
  more	
  likely	
  to	
  
be	
   similar	
   to	
   the	
   other	
   data	
   points	
   than	
  
imputing	
   averages.	
   The	
   second	
   advantage	
   of	
  
using	
  hot	
  deck	
  imputation	
  is	
  that	
  the	
  “method	
  
does	
  not	
  rely	
  on	
  model	
  fitting	
  for	
  the	
  variable	
  
to	
   be	
   imputed	
   and	
   thus	
   is	
   potentially	
   less	
  
sensitive	
   to	
   model	
   misspecification	
   than	
   an	
  
imputation	
   method	
   based	
   on	
   a	
   parametric	
  
method	
   such	
   as	
   regression	
   imputation”	
  
(Andridge	
  &	
  Little,	
  2011,	
  para.	
  3).	
  
	
  
CORRELATION	
  ANALYSIS	
  
	
  
There	
  are	
  specific	
  variables	
  in	
  the	
  dataset	
  that	
  
reveal	
   interesting	
   patterns	
   such	
   as	
   the	
  
number	
   of	
   mosquitos,	
   temperature	
   and	
  
precipitation.	
  	
  
	
  
The	
  goal	
  of	
  the	
  correlation	
  analysis	
  was	
  to	
  plot	
  
or	
   capture	
   a	
   trend	
   that	
   would	
   explain	
   the	
  
relationship	
   between	
   the	
   variables	
   and	
   the	
  
presence	
   of	
   the	
   West	
   Nile	
   Virus.	
   Since	
   the	
  
variables	
  are	
  on	
  different	
  scales	
  the	
  variables	
  
were	
  normalized	
  using	
  the	
  Z	
  score	
  formula.	
  In	
  
addition	
   to	
   normalizing	
   the	
   data,	
   average	
  
values	
   of	
   the	
   said	
   variables	
   were	
   considered	
  
in	
  building	
  the	
  plots.	
  
	
  
The	
  plots	
  pertain	
  to	
  weekly	
  records	
  captured	
  
for	
  4	
  years:	
  2007,	
  2009,	
  2011	
  and	
  2013	
  for	
  the	
  
months	
  between	
  late	
  May	
  and	
  early	
  October.	
  
Individual	
   plots	
   have	
   been	
   drawn	
   for	
   each	
  
year.	
  
	
  
The	
  blue	
  line	
  shows	
  the	
  average	
  precipitation.	
  
The	
   red	
   line	
   shows	
   the	
   average	
   number	
   of	
  
mosquitos,	
   the	
   green	
   line	
   shows	
   the	
   average	
  
temperature	
   and	
   the	
   purple	
   line	
   shows	
   the	
  
presence	
  of	
  the	
  virus.	
  	
  	
  	
  
	
  
	
  
Figure	
  1:	
  2007	
  
According	
  to	
  the	
  line	
  graph	
  for	
  the	
  year	
  2007,	
  
a	
   sudden	
   decrease	
   in	
   temperature	
   causes	
  
mosquitos	
   to	
   decrease	
   after	
   week	
   35.	
  
Consequently,	
  the	
  average	
  number	
  of	
  detected	
  
virus	
  decreases.	
  	
  
	
  
It	
   was	
   also	
   noted	
   that	
   the	
   higher	
   the	
  
temperature	
   and	
   the	
   precipitation	
   gets,	
   the	
  
higher	
   the	
   number	
   of	
   mosquitos	
   and	
  
subsequently	
   the	
   higher	
   the	
   probability	
   for	
  
the	
  presence	
  of	
  the	
  West	
  Nile	
  virus.	
  
	
  
	
  
An	
   interesting	
   pattern	
   was	
   found	
   between	
  
precipitation	
  and	
  the	
  increase	
  in	
  the	
  number	
  
 
	
  
8	
  
of	
  mosquitos.	
  	
  The	
  increase	
  in	
  the	
  number	
  of	
  
	
  
Figure	
  2:	
  2009	
  
mosquitos	
  occurs	
  rapidly	
  not	
  during	
  the	
  week	
  
of	
  high	
  precipitation	
  but	
  in	
  the	
  week	
  after.	
  	
  It	
  
appears	
  that	
  once	
  the	
  numbers	
  of	
  mosquitos’	
  
increase.	
  Then	
  the	
  virus	
  infects	
  the	
  mosquitos.	
  	
  
	
  
The	
  number	
  of	
  mosquitos	
  in	
  week	
  35	
  is	
  low.	
  
However,	
   the	
   graph	
   shows	
   that	
   the	
   presence	
  
of	
   the	
   virus	
   is	
   prominent	
   than	
   before	
  
indicating	
   that	
   all	
   of	
   the	
   mosquitos	
   have	
   the	
  
virus	
   in	
   their	
   blood	
   although	
   the	
   mosquito	
  
population	
  is	
  small.	
  	
  
	
  
Not	
  surprisingly,	
  as	
  the	
  temperature	
  declines	
  
rapidly	
   [even	
   with	
   high	
   precipitation],	
   the	
  
number	
   of	
   mosquitos	
   and	
   the	
   presence	
   of	
  
WNV	
   drops.	
   	
   All	
   plots	
   have	
   captured	
   similar	
  
trends.	
  
	
  
	
  
Figure	
  3:	
  2011	
  
	
  
Figure	
  4:	
  2013	
  
	
  
The	
  scatterplots	
  below	
  shows	
  that	
  the	
  number	
  
of	
  mosquitos	
  and	
  the	
  presence	
  of	
  WNV	
  has	
  a	
  
positive	
   relationship	
   with	
   dmonth,	
   dweek,	
  
dewpoint,	
   cool,	
   tmax,	
   tmin,	
   tavg	
   and	
   spray.	
  
Therefore,	
   the	
   model	
   will	
   certainly	
   rely	
   on	
  
these	
  features	
  more	
  than	
  the	
  others	
  to	
  predict	
  
WNV.	
  	
  
	
  
	
  
	
  
Though	
   the	
   relationships	
   are	
   positive	
   the	
  
strength	
   however,	
   appears	
   to	
   be	
   weak.	
   A	
  
closer	
   look	
   at	
   the	
   scatterplots	
   shows	
   some	
  
evidence	
   of	
   multicolinearity.	
   For	
   instance,	
   in	
  
the	
   plot	
   titled	
   temp	
   and	
   weather	
   there	
   are	
  
blocks	
   of	
   strong	
   positive	
   correlations	
   that	
  
indicate	
   colinearity.	
   	
   An	
   issue	
   to	
   consider	
   in	
  
the	
  modeling	
  process.	
  	
  
	
  
MODELS	
  
	
  
Accurately	
   predicting	
   the	
   presence	
   of	
   WNV	
  
essentially	
   amounts	
   to	
   selecting	
   the	
   best	
  
spatial,	
   temporal	
   and	
   weather	
   features	
   along	
  
with	
   a	
   specifically	
   tuned	
   classification	
  
algorithm.	
   It	
   is	
   evident	
   from	
   the	
   exploratory	
  
analysis	
  as	
  well	
  as	
  from	
  literature	
  that	
  certain	
  
individual	
   features	
   are	
   crucial	
   in	
   predicting	
  
WNV.	
  	
  
	
  
Therefore,	
  the	
  modeling	
  process	
  for	
  this	
  data	
  
set	
  will	
  be	
  broken	
  into	
  two	
  parts.	
  Part	
  I,	
  will	
  
focus	
  on	
  determining	
  how	
  to	
  best	
  incorporate	
  
the	
   available	
   features	
   into	
   a	
   classification	
  
model.	
  	
  Part	
  II,	
  will	
  focus	
  on	
  investigating	
  and	
  
 
	
  
9	
  
fine	
   tuning	
   the	
   specific	
   classification	
  
algorithms	
   to	
   yield	
   the	
   best	
   possible	
  
prediction.	
  	
  	
  
	
  
Part	
  I	
  
	
  
Weather	
  Data	
  and	
  Principal	
  Component	
  
Analysis	
  
	
  
Due	
   to	
   the	
   number	
   of	
   weather	
   attributes	
  
available	
   to	
   the	
   researcher	
   in	
   the	
   dataset,	
   it	
  
becomes	
   quite	
   difficult	
   to	
   ascertain	
   the	
  
combination	
  that	
  will	
  result	
  in	
  the	
  best	
  model.	
  
Moreover,	
  the	
  nature	
  of	
  weather	
  is	
  such	
  that	
  
most	
  individual	
  features	
  will	
  be	
  correlated	
  to	
  
another	
   resulting	
   in	
   multicolinearity.	
   For	
  
example,	
   the	
   amount	
   of	
   precipitation	
   will	
   be	
  
correlated	
   to	
   atmospheric	
   pressure	
   and	
   in	
  
turn,	
  be	
  correlated	
  to	
  temperature.	
  	
  Therefore	
  
to	
   combat	
   multicolinearity	
   principal	
  
component	
  analysis	
  (PCA)	
  was	
  used	
  to	
  extract	
  
features	
   that	
   highlight	
   the	
   similarities	
   and	
  
differences	
  of	
  the	
  original	
  weather	
  data	
  while	
  
eliminating	
   the	
   detrimental	
   effects	
   that	
   can	
  
result	
  from	
  the	
  linear	
  dependency	
  of	
  predictor	
  
variables.	
  	
  	
  
	
  
Figure	
   5	
   summarizes	
   the	
   results	
   of	
   PCA	
  
conducted	
  on	
  the	
  weather	
  attributes.	
  The	
  first	
  
five	
  components	
  capture	
  97%	
  of	
  the	
  variation	
  
in	
   the	
   weather	
   data.	
   The	
   loadings	
   of	
  
component	
   1	
   suggest	
   it	
   is	
   highly	
   related	
   to	
  
temperature,	
   humidity	
   and	
   pressure;	
   a	
   large	
  
value	
   for	
   component	
   1	
   seems	
   to	
   represent	
   a	
  
sunny	
  but	
  chilly	
  day.	
  Component	
  2	
  appears	
  to	
  
capture	
  wind	
  information,	
  while	
  component	
  3	
  
summarizes	
   precipitation.	
   The	
   first	
   5	
  
components	
  from	
  PCA	
  will	
  be	
  used	
  to	
  reflect	
  
the	
  weather	
  conditions	
  of	
  a	
  specific	
  day	
  in	
  the	
  
data.	
  	
  	
  	
  
	
  
	
  
Figure	
  5:	
  PCA	
  
Figure	
  6:	
  Clustering	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
10	
  
Figure	
  7:	
  Model	
  Summary	
  
	
  
Temporally	
  based	
  weather	
  variables	
  and	
  week	
  
number	
  
	
  
While	
  the	
  weather	
  conditions	
  of	
  a	
  specific	
  day	
  
can	
   affect	
   the	
   activity	
   level	
   of	
   mosquitos	
   for	
  
that	
   day,	
   it	
   does	
   not	
   take	
   into	
   account	
   a	
  
mosquito’s	
  life-­‐cycle	
  or	
  the	
  timing	
  of	
  weather	
  
conditions	
   and	
   its	
   effect	
   on	
   mosquito	
  
populations.	
  Hence,	
  engineered	
  features	
  such	
  
as	
   growing	
   degree	
   day,	
   moving	
   temperature	
  
averages/sums	
   and	
   moving	
   precipitation	
  
averages/sums	
   (all	
   mentioned	
   in	
   previous	
  
sections)	
  will	
  be	
  included	
  in	
  the	
  model.	
  	
  
	
  
Also,	
   week	
   numbers	
   of	
   the	
   year	
   will	
   be	
  
incorporated	
   to	
   capture	
   the	
   inter-­‐annual	
  
timing	
  of	
  mosquito	
  populations.	
  	
  
	
  
Clustering	
  Location	
  Data	
  
	
  
Determining	
  a	
  good	
  way	
  to	
  represent	
  location	
  
will	
  most	
  likely	
  improve	
  the	
  predictive	
  power	
  
of	
   the	
   models.	
   Although,	
   the	
   WNV	
   challenge	
  
provides	
  raw	
  longitude	
  and	
  latitude	
  values	
  to	
  
represent	
  location,	
  it	
  is	
  believed	
  to	
  not	
  be	
  in	
  a	
  
form	
   that	
   will	
   be	
   conducive	
   to	
   predictive	
  
modeling	
   due	
   to	
   the	
   non-­‐linear	
   nature	
   of	
  
spatial	
  data.	
  	
  
	
  
Thus	
  k-­‐means	
  algorithm	
  (k	
  =	
  20)	
  was	
  used	
  to	
  
translate	
   the	
   location	
   data	
   represented	
   by	
  
longitude/	
   latitude	
   pairs	
   into	
   clustered	
  
locations.	
   Figure	
   6	
   shows	
   the	
   location	
   of	
   the	
  
clusters	
  using	
  a	
  normalized	
  scale.	
  
	
  
	
  
As	
   one	
   can	
   observe,	
   the	
   clustered	
   locations	
  
outline	
   the	
   Chicago	
   area	
   quite	
   accurately.	
  
These	
   clustered	
   locations	
   will	
   be	
   used	
   as	
   a	
  
categorical	
  variable	
  in	
  our	
  models.	
  	
  
	
  
Part	
  II	
  
	
  
With	
   the	
   necessary	
   data	
   pre-­‐processing	
   and	
  
variable	
   transformations	
   completed.	
   The	
  
focus	
   was	
   moved	
   onto	
   the	
   construction	
   of	
  
models	
  to	
  predict	
  WNV.	
  The	
  overall	
  approach	
  
was	
  to	
  build	
  an	
  ensemble,	
  a	
  model	
  that	
  takes	
  a	
  
weighted	
   average	
   of	
   a	
   set	
   of	
   classifiers	
   that	
  
generally	
   outperforms	
   the	
   individual	
  
classifiers	
   upon	
   which	
   the	
   ensemble	
   is	
   built	
  
from.	
   The	
   strategy	
   was	
   to	
   consider	
   five	
  
individual	
   algorithms	
   and	
   build	
   the	
   best	
  
possible	
  classifier	
  out	
  of	
  each	
  to	
  include	
  in	
  the	
  
final	
   ensemble	
   model:	
   1)	
   logistic	
   regression	
  
using	
  a	
  generalized	
  additive	
  model	
  (GAM),	
  2)	
  
linear	
  discriminant	
  analysis	
  (LDA),	
  3)	
  random	
  
forests,	
   and	
   4)	
   support	
   vector	
   machines	
  
(SVM).	
  Kaggle’s	
  train	
  dataset	
  was	
  split	
  by	
  70%	
  
and	
   30%	
   probabilities	
   where	
   the	
   70%	
   was	
  
used	
   as	
   the	
   training	
   set	
   and	
   the	
   remaining	
  
30%	
   served	
   as	
   the	
   hold	
   out	
   for	
   the	
   test	
  
dataset.	
  	
  
	
  
Figure	
  7	
  is	
  a	
  summary	
  of	
  all	
  the	
  best	
  set-­‐ups	
  
for	
   each	
   algorithm.	
   Of	
   all	
   the	
   individual	
  
models,	
  GAM	
  was	
  clearly	
  the	
  best	
  performing	
  
with	
   an	
   AUC	
   value	
   of	
   0.8253717.	
   The	
   best	
  
overall	
   ensemble	
   classifier	
   was	
   a	
   weighted	
  
average	
  of	
  GAM	
  and	
  SVM	
  with	
  weights	
  of	
  0.6	
  
and	
   0.4,	
   respectively,	
   and	
   an	
   AUC	
   of	
  
0.8361962.	
  	
  
	
  
	
  
 
	
  
11	
  
CONCLUSION	
  
	
  
Although	
  the	
  ensemble	
  model	
  had	
  the	
  highest	
  
AUC	
  value	
  achieved	
  in	
  the	
  training	
  dataset,	
  it	
  
only	
  reached	
  an	
  AUC	
  of	
  0.6220	
  on	
  the	
  Kaggle	
  
leaderboard.	
  	
  	
  
	
  
In	
   fact,	
   over	
   50	
   models	
   were	
   submitted	
   to	
  
Kaggle	
   and	
   the	
   results	
   were	
   rarely	
   as	
  
expected.	
   The	
   two	
   best	
   models	
   on	
   the	
  
leaderboard	
  consisted	
  of	
  an	
  ensemble	
  of	
  GAM	
  
logistic	
  regression	
  and	
  GLM	
  logistic	
  regression	
  
and	
   a	
   slightly	
   modified	
   Poisson	
   GLM	
   model.	
  
Both	
   did	
   not	
   have	
   notable	
   training	
   AUCs	
   but	
  
performed	
  well	
  on	
  Kaggle.	
  	
  	
  
	
  
Other	
  validation	
  techniques	
  were	
  investigated	
  
in	
  an	
  attempt	
  to	
  obtain	
  better	
  feedback	
  from	
  
the	
   training	
   process	
   which	
   resulted	
   in	
   the	
  
build	
   of	
   a	
   better	
   model.	
   Instead	
   of	
   using	
   a	
  
70/30	
   training	
   and	
   testing	
   split,	
   a	
   modified	
  
version	
   of	
   n-­‐fold	
   cross	
   validation	
   was	
   used	
  
where	
  one	
  year’s	
  data	
  was	
  left	
  out	
  as	
  testing	
  
and	
   the	
   remaining	
   years	
   were	
   used	
   as	
  
training.	
   This	
   process	
   was	
   repeated	
   four	
  
times,	
   once	
   for	
   each	
   year,	
   and	
   this	
   averaged	
  
the	
   model’s	
   performance.	
   The	
   best	
   models	
  
achieved	
   from	
   this	
   validation	
   technique	
   did	
  
not	
  seem	
  any	
  different	
  from	
  the	
  models	
  built	
  
on	
  a	
  traditional	
  70/30	
  split.	
  
	
  Figure	
  8:	
  Models	
  &	
  Imbalance	
  
Because	
  there	
  is	
  a	
  gross	
  imbalance	
  of	
  positive	
  
and	
   negative	
   cases	
   in	
   the	
   WNV	
   data	
   further	
  
examination	
   was	
   conducted	
   to	
   see	
   if	
   the	
  
imbalance	
   had	
   any	
   influence	
   on	
   the	
  
effectiveness	
  of	
  training	
  and	
  validation.	
  Figure	
  
8	
   shows	
   the	
   performance	
   of	
   several	
   models	
  
and	
   its	
   relationship	
   with	
   data	
   imbalance.	
  
Except	
  for	
  one	
  model,	
  none	
  displayed	
  a	
  drastic	
  
sensitivity	
  to	
  data	
  balance.	
  
	
  
If	
   using	
   the	
   appropriate	
   validation	
   technique	
  
does	
   not	
   account	
   for	
   the	
   disparity	
   between	
  
training	
  AUC	
  and	
  the	
  Kaggle	
  leaderboard	
  AUC,	
  
it	
  is	
  surmised	
  that	
  there	
  may	
  be	
  a	
  fundamental	
  
difference	
   between	
   the	
   characteristics	
   of	
   the	
  
training	
  data	
  and	
  testing	
  data.	
  	
  
	
  
Specifically,	
   it	
   is	
   possible	
   that	
   there	
   are	
  
idiosyncratic	
   intra-­‐annual	
   variations	
   in	
  
weather	
   that	
   cannot	
   be	
   captured	
   in	
   the	
  
training	
   set	
   due	
   to	
   how	
   the	
   WNV	
   problem	
   is	
  
set	
   up.	
   Ezanno	
   et	
   al	
   (2014)	
   cites	
   that	
  
population	
  of	
  certain	
  mosquito	
  species	
  does	
  in	
  
fact	
   have	
   inter-­‐annual	
   variations	
   due	
   to	
  
specific	
  weather	
  events	
  in	
  a	
  year.	
  	
  	
  
	
  
It	
   is	
   therefore	
   suspected,	
   that	
   the	
   best	
  
algorithms	
  discussed	
  afore	
  are	
  over	
  fitting	
  the	
  
training	
   data.	
   While	
   the	
   best	
   models	
   in	
   this	
  
study	
  capture	
  the	
  variations	
  in	
  weather	
  in	
  the	
  
training	
  data	
  well,	
  it	
  is	
  unable	
  to	
  replicate	
  this	
  
in	
  the	
  testing	
  data.	
  	
  
	
  
This	
   intuitively	
   makes	
   sense	
   as	
   most	
   of	
   the	
  
models	
  that	
  performed	
  better	
  on	
  Kaggle	
  tend	
  
to	
   be	
   simple	
   models	
   that	
   included	
   variables	
  
like	
   location,	
   week	
   number	
   and	
   mosquito	
  
species	
  that	
  is	
  generalizable	
  through	
  all	
  years	
  
of	
  the	
  data.	
  	
  
	
  
 
	
  
12	
  
Other	
   matter	
   of	
  
consideration	
   for	
   future	
  
model	
   building	
   is	
   the	
  
importance	
   of	
   the	
   spray	
  
data.	
   Though	
   the	
   spray	
  
data	
   is	
   not	
   a	
   part	
   of	
   the	
  
testing	
  dataset	
  and	
  would	
  
warrant	
   an	
   immediate	
  
dismissal	
   from	
   the	
  
predictor	
   selection	
  
process,	
   the	
   following	
  
heat	
   map	
   implies	
  
otherwise.	
   	
   Upon	
   close	
  
inspection	
   of	
   the	
   heat	
  
map	
   one	
   speculates	
   that	
  
spraying	
   one	
   year	
   does	
  
indeed	
  alter	
  the	
  effects	
  of	
  
population	
  the	
  next	
  year,	
  
which	
   might	
   explain	
   why	
   mosquito	
  
populations	
  appear	
  in	
  different	
  locations	
  each	
  
year.	
  	
  
	
  
Also,	
   feature	
   engineering	
   of	
   the	
   predictor	
  
variable,	
   depart	
   [departure	
   from	
   normal],	
  
might	
   help	
   in	
   creating	
   a	
   deeper	
   level	
   of	
  
understanding	
  the	
  problem	
  statement	
  at	
  hand.	
  
A	
  possible	
  means	
  of	
  engineering	
  this	
  predictor	
  
would	
   be	
   to	
   categorize	
   the	
   deviance	
   from	
  
temperature	
  normalcy	
  as	
  hotter	
  than	
  normal	
  
and	
  colder	
  than	
  normal.	
  	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
13	
  
	
  
Appendix	
  
	
  
Table	
  3:	
  Data	
  Fields	
  
FIELDS	
  
Number	
   Train	
   Weather	
   Spray	
   Test	
  
1	
   Date	
   Station	
   Date	
   ID	
  
2	
   Address	
   Date	
   Time	
   Date	
  
3	
   Species	
   Max	
  Temperature	
  	
   Latitude	
   Address	
  
4	
   Block	
   Min	
  Temperature	
   Longitude	
   Species	
  
5	
   Street	
   Avg	
  Temperature	
   	
   Block	
  
6	
   Trap	
   Departure	
  from	
  Normal	
   	
   Street	
  
7	
   Address	
  Number	
   Dew	
  Point	
   	
   Trap	
  
8	
   Latitude	
   Wet	
  Bulb	
   	
   Address	
  Number	
  
9	
   Longitude	
   Heat	
   	
   Latitude	
  
10	
   Address	
  Accuracy	
   Cool	
   	
   Longitude	
  
11	
   #	
  of	
  Mosquitoes	
   Sunrise	
   	
   Address	
  Accuracy	
  
12	
   Wnvpresent	
   Sunset	
   	
   	
  
13	
   	
   Code	
  Sum	
   	
   	
  
14	
   	
   Depth	
  	
   	
   	
  
15	
   	
   Water1	
   	
   	
  
16	
   	
   Snowfall	
   	
   	
  
17	
   	
   Total	
  Precipitation	
   	
   	
  
18	
   	
   Station	
  Pressure	
   	
   	
  
19	
   	
   Sea	
  Level	
   	
   	
  
20	
   	
   Wind	
  Speed	
   	
   	
  
21	
   	
   Wind	
  Direction	
   	
   	
  
22	
   	
   Average	
  Speed	
   	
   	
  
 
	
  
14	
  
TABLE 2 | SPECIES 2
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
15	
  
SKEWNESS OF VARIABLES & OUTLIERS
DATE PATTERN
	
  
	
  
	
  
The data is skewed to the
left. There are more records
for 2007 than other years
but not by a significant
amount. If this becomes
problematic, we may
sample equal number of
records for each year.
	
  
	
  
	
  
	
  
 
	
  
16	
  
LATITUDE PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  Latitude	
  is	
  very	
  slightly	
  
skewed	
  to	
  the	
  left.	
  Mean	
  is	
  less	
  than	
  the	
  
median	
  	
  	
  
	
  
Center:	
  41.84628	
  	
  
	
  
Spread: 41.64461 to 42.01743
	
  
	
  
	
  
	
  
	
  
 
	
  
17	
  
LONGITUDE PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  Longitude	
  is	
  symmetric	
  
	
  
Center:	
  -­‐87.69499	
  	
  
	
  
Spread: -87.93099 to -87.53163
	
  
	
  
	
  
	
  
	
  
 
	
  
18	
  
NUMBER OF MOSQUITOS PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
   right	
  
skewed	
   as	
   the	
   mean	
   is	
   12.85351	
  	
  
being	
   pulled	
   to	
   the	
   right	
   away	
   from	
  
the	
  median	
  which	
  is	
  5	
  
	
  
Center:	
  5	
  
	
  
Spread: 1 to 50
Outlier: The boxplot confirms the
skewedness of the histogram in that
there are large numbers causing the
distribution to be pulled to the right.
The outlier function indicates the
largest number in the data for
number of mosquitos is 50
	
   	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
19	
  
DISTANCE FROM O’HARE PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  symmetric	
  	
  
	
  
Center:	
  0.2943334	
  
	
  
Spread: 0.0372549 to 0.5179756
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
20	
  
DISTANCE FROM MIDWAY PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
   slightly	
  
skewed	
   to	
   the	
   left	
   as	
   the	
   mean	
  
0.1548598	
   is	
   pulled	
   away	
   from	
   the	
  
median	
  0.1616137	
  	
  
	
  
Center:	
  0.1616137	
  
	
  
Spread: 0.0077139 to 0.2481943
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
21	
  
MAXIMUM TEMPERATURE PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  s	
  skewed	
  
to	
   the	
   left	
   as	
   the	
   mean	
   81.94765	
   is	
  
pulled	
   away	
   to	
   the	
   left	
   from	
   the	
  
median	
  83	
  	
  
	
  
Center:	
  83	
  
	
  
Spread: 57 to 97
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 57 is the
point that is distant from the other
values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
22	
  
MINIMUM TEMPERATURE PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  s	
  skewed	
  
to	
   the	
   left	
   as	
   the	
   mean	
   64.16533	
   is	
  
pulled	
   away	
   to	
   the	
   left	
   from	
   the	
  
median	
  66	
  	
  
	
  
Center:	
  66	
  
	
  
Spread: 41 to 79
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 41 is the
point that is distant from the other
values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
23	
  
AVERAGE TEMPERATURE PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
  
skewed	
   to	
   the	
   left	
   as	
   the	
   mean	
  
38.28412	
  is	
  pulled	
  away	
  to	
  the	
  left	
  
from	
  the	
  median	
  40	
  	
  
	
  
Center:	
  40	
  
	
  
Spread: 15 to 52
Outlier: The box plot shows
the presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 15 is the
point that is distant from the other
values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
24	
  
TOTAL PRECIPITATION PATTERN
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
  
skewed	
   to	
   the	
   right	
   as	
   the	
   mean	
  
0.1274281	
   is	
   pulled	
   away	
   to	
   the	
  
right	
  from	
  the	
  median	
  0	
  	
  
	
  
Center:	
  0	
  
	
  
Spread: 0.00 to 3.97
Outlier: The box plot shows
the presence of some points
influencing the movement of the
distribution to the right. The outlier
function indicates that 3.97 is the
point that is distant from the other
values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
25	
  
	
  
RESULT OF WIND SPEED PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  	
  skewed	
  
to	
  the	
  right	
  as	
  the	
  mean	
  5.911003	
  is	
  
pulled	
   away	
   to	
   the	
   left	
   from	
   the	
  
median	
  5.5	
  
	
  
Center:	
  5.5	
  
	
  
Spread: 0.1 to 15.4
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the right. The outlier
function indicates that 15.4 is the
point that is distant from the other
values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
26	
  
	
  
RESULT OF WIND DIRECTION PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  skewed	
  
to	
  the	
  left	
  as	
  the	
  mean	
  17.72016	
  is	
  
pulled	
   away	
   to	
   the	
   left	
   from	
   the	
  
median	
  19	
  	
  
	
  
Center:	
  19	
  
	
  
Spread: 1 to 36
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
27	
  
AVERAGE WIND SPEED PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
  
skewed	
   to	
   the	
   left	
   as	
   the	
   mean	
  
123.4147	
  is	
  pulled	
  away	
  to	
  the	
  left	
  
from	
  the	
  median	
  139	
  	
  
	
  
Center:	
  139	
  
	
  
Spread: 3 to 177
Outlier: The box plot shows
the presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 3 is the
point that is distant from the other
values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
28	
  
TEMPERATURE MOVING AVERAGES - 1 WEEK PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
   skewed	
  
to	
   the	
   left	
   as	
   the	
   mean	
   72.5431	
   is	
  
pulled	
   away	
   to	
   the	
   left	
   from	
   the	
  
median	
  73.14286	
  
	
  
Center:	
  73.14286	
  
	
  
Spread: 53.14286 to 83.85714
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 53.14286 is
the point that is distant from the
other values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
29	
  
TEMPERATURE MOVING AVERAGES – 2 WEEK PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
   skewed	
  
to	
   the	
   left	
   as	
   the	
   mean	
   72.41439	
   is	
  
pulled	
   away	
   to	
   the	
   left	
   from	
   the	
  
median	
  73	
  
	
  
Center:	
  73	
  
	
  
Spread: 55.07143 to 82.76923
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the left. The outlier
function indicates that 55.07143 is
the point that is distant from the
other values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
30	
  
MOVING AVGS OF PRECIPITATION – 1 WEEK PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  skewed	
  to	
  
the	
   right	
   as	
   the	
   mean	
   0.1333564	
   is	
  
pulled	
   away	
   to	
   the	
   right	
   from	
   the	
  
median	
  0.07	
  
	
  
Center:	
  0.07	
  
	
  
Spread: -0.0000 to 1.42857
Outlier: The box plot shows the
presence of some points influencing
the movement of the distribution to the
right. The outlier function indicates
that 1.42857 is the point that is distant
from the other values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
31	
  
MOVING AVGS OF PRECIPITATION – 2 WEEK PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  skewed	
  to	
  
the	
   right	
   as	
   the	
   mean	
   0.130	
   is	
   pulled	
  
away	
   to	
   the	
   right	
   from	
   the	
   median	
  
0.085	
  
	
  
Center:	
  0.085	
  
	
  
Spread: 0.0007 to 0.76714
Outlier: The box plot shows the
presence of some points influencing
the movement of the distribution to
the right. The outlier function
indicates that 0.76714 is the point
that is distant from the other values in
the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
32	
  
MOVING SUM OF PRECIPITATION – 1 WEEK PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
   skewed	
  
to	
  the	
  right	
  as	
  the	
  mean	
  0.9432334	
  is	
  
pulled	
   away	
   to	
   the	
   right	
   from	
   the	
  
median	
  0.53	
  
	
  
Center:	
  0.53	
  
	
  
Spread: -0.000 to 9.149
Outlier: The box plot shows the
presence of some points
influencing the movement of the
distribution to the right. The outlier
function indicates that 9.15 is the
point that is distant from the other
values in the dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
33	
  
MOVING SUM OF PRECIPITATION – 2 WEEK PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
  The	
  distribution	
  is	
  skewed	
  to	
  
the	
  right	
  as	
  the	
  mean	
  1.74216	
  is	
  pulled	
  
away	
  to	
  the	
  right	
  from	
  the	
  median	
  1.1	
  
	
  
Center:	
  1.1	
  
	
  
Spread: -0.000 to 10.74999
Outlier: The box plot shows the
presence of some points influencing
the movement of the distribution to
the right. The outlier function
indicates that 10.75 is the point that
is distant from the other values in the
dataset.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
34	
  
DEGREE DAY PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
   skewed	
  
to	
  the	
  right	
  as	
  the	
  mean	
  3.824472	
  is	
  
pulled	
   away	
   to	
   the	
   right	
   from	
   the	
  
median	
  3.4	
  
	
  
Center:	
  3.4	
  
	
  
Spread: 0.0 to 14.9
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
35	
  
ACCUMULATED DEGREE DAY FOR EACH YEAR PATTERN
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shape:	
   The	
   distribution	
   is	
  
skewed	
   to	
   the	
   right	
   as	
   the	
   mean	
  
241.0934	
   is	
   pulled	
   away	
   to	
   the	
  
right	
  from	
  the	
  median	
  239.6	
  
	
  
Center:	
  239.6	
  
	
  
Spread: 1.3 to 521.1
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
36	
  
LOWER & UPPER BOUND OUTLIERS
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
37	
  
GROUPED LINE GRAPH | YEAR 2007
	
  
	
  
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
38	
  
GROUPED LINE GRAPH | YEAR 2009
	
  
	
  
	
  
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
  
39	
  
	
  
	
  
	
  
	
  
	
  
	
  
GROUPED LINE GRAPH | YEAR 2011
	
  
	
  
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
	
  
 
	
  
40	
  
	
  
	
  
	
  
	
  
	
  
GROUPED LINE GRAPH | YEAR 2013
	
  
	
  
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
	
  
	
  
 
	
  
41	
  
Works Cited
Abdi, Herve. Multivariate analysis. Retrieved from
www.utdallas.edu/~herve/Abdi-MultivariateAnalysis-pretty.pdf
Andridge & Little. (2011). A review of hot deck imputation for survey non – response
Int Stat Rev. 78(1): 40-64. Retrieved from
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/
Ezanno, P, Aubry-Kientz, M et al. (2015). A generic weather driven model to predict
Mosquito population dynamics applied to species of anopheles, culex
And aedes genera of southern France. 120(1): 39-50. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/25623972
Kaggle. West Nile Prediction. Retrieved from: https://www.kaggle.com/c/predict-
west-nile-virus/data
Kuhn & Johnson (2013). Applied Predictive Modeling. New York, Springer.
Natural Resources Management and Environmental Departments. Annex 4:
Statistical Analysis of Weather Data Sets 1. Retrieved from:
http://www.fao.org/docrep/x0490e/x0490e0l.htm#TopOfPage
Ruiz, Marilyn O., F Chavez Luis et al. (2010). Local impact of temperature and
precipitation on west Nile virus infection in culex species mosquitoes
in northeast Illinois, USA. Parasites & Vectors. Retrieved from
http://www.parasitesandvectors.com/content/3/1/19.
Ruiz, Marilyn 0., Edward D. Walker et al.(2007). Association of west nile virus
illness and urban landscapes in Chicago and Detroit. International
Journal of Health Geographics.
Theophilidies, C.N., S.C. Ahearni et al. (2006). First evidence of west nile virus
amplification and relationship to human infections. International
Journal of Geographical Information Science, 20, 103 -115.
Sim, C.H, Gan, F. F. et al (2005), Outlier: labeling with boxplot procedures.
Journal of American Statistical Association, 100(470).
Retrieved from: http://www.jstor.org/stable/27590584

More Related Content

Similar to MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS

Malaria And URTI epidemiological project (2015)
Malaria And URTI epidemiological project (2015)Malaria And URTI epidemiological project (2015)
Malaria And URTI epidemiological project (2015)Samson Tsuma
 
Perspectives of predictive epidemiology and early warning systems for Rift Va...
Perspectives of predictive epidemiology and early warning systems for Rift Va...Perspectives of predictive epidemiology and early warning systems for Rift Va...
Perspectives of predictive epidemiology and early warning systems for Rift Va...ILRI
 
Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...
Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...
Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...Nanyingi Mark
 
Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...
Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...
Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...Nanyingi Mark
 
Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...eSAT Journals
 
One health Perspective and Vector Borne Diseases
One health Perspective and Vector Borne DiseasesOne health Perspective and Vector Borne Diseases
One health Perspective and Vector Borne DiseasesNanyingi Mark
 
Predicting west nile virus in mosquitos across the city of chicago
Predicting west nile virus in mosquitos across the city of chicagoPredicting west nile virus in mosquitos across the city of chicago
Predicting west nile virus in mosquitos across the city of chicagoTharindu Ranasinghe
 
Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...
Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...
Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...AI Publications
 
Computational Epidemiology (Review) : Notes
Computational Epidemiology (Review) : NotesComputational Epidemiology (Review) : Notes
Computational Epidemiology (Review) : NotesSubhajit Sahu
 
Sas Shootout Team Report
Sas Shootout Team ReportSas Shootout Team Report
Sas Shootout Team ReportKrishPrabu
 
Gis application to the projects
Gis application to the projectsGis application to the projects
Gis application to the projectsmuisyoj
 
is the world a computation of... us?
is the world a computation of... us?is the world a computation of... us?
is the world a computation of... us?Jones Albuquerque
 
Remote Sensing Health Assessment
Remote Sensing Health AssessmentRemote Sensing Health Assessment
Remote Sensing Health AssessmentJessie Wang
 
Developing social vulnerability index for newcastle extreme temperature risk
Developing social vulnerability index for newcastle extreme temperature riskDeveloping social vulnerability index for newcastle extreme temperature risk
Developing social vulnerability index for newcastle extreme temperature riskAlex Nwoko
 
Zhao_Danton_SR16_Poster
Zhao_Danton_SR16_PosterZhao_Danton_SR16_Poster
Zhao_Danton_SR16_PosterDanton Zhao
 
Descriptive epidemiology involves critical organization, summarizi.docx
Descriptive epidemiology involves critical organization, summarizi.docxDescriptive epidemiology involves critical organization, summarizi.docx
Descriptive epidemiology involves critical organization, summarizi.docxtheodorelove43763
 

Similar to MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS (20)

Malaria And URTI epidemiological project (2015)
Malaria And URTI epidemiological project (2015)Malaria And URTI epidemiological project (2015)
Malaria And URTI epidemiological project (2015)
 
Perspectives of predictive epidemiology and early warning systems for Rift Va...
Perspectives of predictive epidemiology and early warning systems for Rift Va...Perspectives of predictive epidemiology and early warning systems for Rift Va...
Perspectives of predictive epidemiology and early warning systems for Rift Va...
 
Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...
Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...
Spatial risk assessment of Rift Valley Fever potential outbreaks using a vect...
 
Chapter 31
Chapter 31Chapter 31
Chapter 31
 
Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...
Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...
Ecological Niche Modelling of Potential RVF Vector Mosquito Species and their...
 
Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...
 
One health Perspective and Vector Borne Diseases
One health Perspective and Vector Borne DiseasesOne health Perspective and Vector Borne Diseases
One health Perspective and Vector Borne Diseases
 
Predicting west nile virus in mosquitos across the city of chicago
Predicting west nile virus in mosquitos across the city of chicagoPredicting west nile virus in mosquitos across the city of chicago
Predicting west nile virus in mosquitos across the city of chicago
 
Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...
Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...
Tripartite Sequential classification Sampling Plans tomonitor Tetranychus urt...
 
Computational Epidemiology (Review) : Notes
Computational Epidemiology (Review) : NotesComputational Epidemiology (Review) : Notes
Computational Epidemiology (Review) : Notes
 
Sas Shootout Team Report
Sas Shootout Team ReportSas Shootout Team Report
Sas Shootout Team Report
 
About medical aid
About medical aidAbout medical aid
About medical aid
 
Gis application to the projects
Gis application to the projectsGis application to the projects
Gis application to the projects
 
is the world a computation of... us?
is the world a computation of... us?is the world a computation of... us?
is the world a computation of... us?
 
Remote Sensing Health Assessment
Remote Sensing Health AssessmentRemote Sensing Health Assessment
Remote Sensing Health Assessment
 
Developing social vulnerability index for newcastle extreme temperature risk
Developing social vulnerability index for newcastle extreme temperature riskDeveloping social vulnerability index for newcastle extreme temperature risk
Developing social vulnerability index for newcastle extreme temperature risk
 
Cluster And Dioxin Exposure / Prof. Jean Francois Viel
Cluster And Dioxin Exposure / Prof. Jean Francois VielCluster And Dioxin Exposure / Prof. Jean Francois Viel
Cluster And Dioxin Exposure / Prof. Jean Francois Viel
 
FINAL PAPER drm
FINAL PAPER  drmFINAL PAPER  drm
FINAL PAPER drm
 
Zhao_Danton_SR16_Poster
Zhao_Danton_SR16_PosterZhao_Danton_SR16_Poster
Zhao_Danton_SR16_Poster
 
Descriptive epidemiology involves critical organization, summarizi.docx
Descriptive epidemiology involves critical organization, summarizi.docxDescriptive epidemiology involves critical organization, summarizi.docx
Descriptive epidemiology involves critical organization, summarizi.docx
 

MULTIVARIATE ANALYSIS PREDICTS WEST NILE VIRUS USING WEATHER SPATIAL FACTORS

  • 1.   1   MULTIVARIATE  ANALYSIS   FARZAD  ESKANDANIAN,  MAX  LI,  JOYCE  ROSE,  NASIM  SONBOLI     CSC  424  |  ADVANCED  DATA  ANALYSIS   6|14|2015         The  purpose  of  this  paper  is  to  discuss  the  model(s)  used  in  predicting  the  presence  or  absence  of  the   West  Nile  virus  [WNV].    The  uniqueness  of  this  multivariate  analysis  is  the  use  of  weather,  temporal   and  spatial  factors  based  on  the  premise  of  time  based  effects.  That  is,  the  models  built  take  into   account  the  developmental  stages  of  a  mosquito.  Four  individual  classifiers    -­‐  1)  logistic  regression   using  a  generalized  additive  model  (GAM),  2)  linear  discriminant  analysis  (LDA),  3)  random  forests,   and  4)  support  vector  machines  (SVM)  –  were  built  and  the  best  combinations  of  parameters  from   each   model   was   included   in   the   ensemble   model.   Species,   week   number,   location,   moving   temperature  averages,  precipitation  moving  averages  and  growing  degree  days  played  an  important   role  in  predicting  WNV.  The  best  overall  ensemble  classifier  was  a  weighted  average  of  GAM  and  SVM   with  weights  of  0.6  and  0.4,  respectively,  and  an  AUC  of  0.8361962       INTRODUCTION       The   west   Nile   Virus   (WNV)   is   “a   mosquito   borne   disease-­‐causing   infectious   agent”   (Theophilides  et  al,  2006,  para.  1)  that  affects   birds,   humans,   and   animals.   In   1999,   WNV   was  first  reported  in  the  United  States.  Since   the   initial   occurrence   the   presence   of   WNV   causing   seasonal   epidemics   have   been   recorded   leading   to   a   series   of   research   focused   on   understanding   the   features   and   characteristics   of   the   virus.   The   research   available   on   WNV   indicates   that   “the   infections   caused   by   pathogens   by   way   of   a   mosquito   vector   often   cluster   in   space   and   time   given   the   habitat   requirements   of   the   vectors   and   the   vertebrate   involved   in   the   transmission.”  (Ruiz  et  al,  2007,  para  8).     In   other   words,   the   West   Nile   viral   transmission   is   attributed   to   the   patterns   of   climate,   landscape,   hydrology   and   types   of   human   settlements.   Ruiz   et   al   (2010)   argue   that   the   statistical   models   built   thus   far   by   researchers   are   mere   reports   that   only   characterize   associations   between   the   virus   and   weather,   landscape,   human   density   etc.   Though  they  offer  insights  about  the  WNV,  the   associations   themselves   are   not   enough   to   develop   and   implement   preventive   measures   for  future  epidemics.  The  interesting  aspect  of   the   WNV   challenge   arises   from   the   need   to   build   a   better   model   that   takes   into   account   the  life  cycle  of  the  mosquitoes  in  relationship   to  the  variability  in  weather  and  its  impact  “on   WEST  NILE  VIRUS  |  CHICAGO  
  • 2.     2   growth   or   activity   of   an   organism.”   Such   a   model  can  take  a  step  beyond  associations  and   indicate  what  the  best  time  and  location  is  for   early  intervention.  The  importance  of  building   a   robust   model   with   predictive   capabilities   lies  in  the  need  to  prevent  an  outbreak  in  the   future.  Therefore  the  goal  of  this  project  is  to   build  a  model  that  uses  weather,  temporal  and   spatial  factors  to  predict  the  West  Nile  virus.       DATA  DESCRIPTION   Kaggle’s  West  Nile  Virus  challenge  consists  of   the  following  datasets1:   Obs   Train   Weather   Spray   Test   10506   2944   14835   116293   Var   12   22   4   11       The  datasets  contains  a  combination  of  string   and  numeric  variables.       “In   many   cases,   some   predictors   have   no   values  for  a  given  sample.  These  missing  data   could   be   structurally   missing”   (Kuhn   &   Johnson,   p.41).   For   instance,   station   2   does   not   collect   information   on   depart,   depth,   water1,   snowfall,   sunset   and   sunrise.   These   structurally   missing   values   are   denoted   by   “M,”   “T”,   or   “-­‐“.   “In   other   cases,   the   value   cannot  or  was  not  determined  at  the  time  of   the   model   building”   (Kuhn   &   Johnson,   p.41).   Examples   of   such   missing   values   are   tavg,   wetbulb,   heat,   cool,   preciptotal,   stnpressure,   sea   level,   time   [584   values]   and   average   speed.  Hence,  the  spray  data  and  the  weather   data  do  contain  missing  values.       The   missing   value   for   the   time   data   set   is   “concentrated  in  a  subset  of  predictors”  (Kuhn   &   Johnson,   p.41).   In   other   words,   the   584   missing   values   pertaining   to   the   spray   data   relates   to   09/07/2011   where   time   has   not                                                                                                                   1 The fields for the datasets can be found in Table 1 in the appendix titled “Data Fields”. been   recorded   after   7:44:32   PM   and   before   7:46:30  PM.  The  non-­‐structurally  missing  data   values   for   the   weather   dataset,   however,   appear   to   occur   randomly   across   all   the   predictors.     The   counts   of   missing   values   for   each   of   the   predictor   variables   have   been   tabulated  below.           The   response   variables   are   the   two   classes   that   the   model   aims   to   predict   namely   the   presence  or  absence  of  the  West  Nile  Virus  [1,   0].       The   explanatory   variables   are:   maximum   temperature,   minimum   temperature,   average   temperature,  precipitation,  result  wind  speed,   result  wind  direction,  species,  trap,  longitude,   latitude,  number  of  mosquitoes  and  address.     EXTERNAL  DATASETS   Although  Kaggle  already  provides  a  number  of   explanatory  variables  for  the  West  Nile  Virus   challenge,   there   are   ample   opportunities   to   include   external   datasets   that   may   contain   other  variables  that  can  improve  a  predictive   model’s  performance.  For  example,  Ruiz  et  al   (2010)   found   that   the   amount   of   vegetation   and  the  degree  to  which  water  would  flow  or   remain   in   an   area   mediated   the   effect   of   weather   in   predicting   the   infection   rate   of   West   Nile   Virus.   Socioeconomic   factors   that   measured   poverty   also   seemed   to   correlate   with  the  presence  of  West  Nile  Virus.  Bringing   in   additional   data   from   reliable   government   sources   that   reflect   the   aforementioned  
  • 3.     3   factors  will  help  us  finely  tune  our  predictive   models.     MULTIVARIATE  ANALYSIS     The  main  objective  of  a  multivariate  analysis   is   to   use   multiple   data   mining   techniques   to   study   how   variables   relate   to   one   another.   This   method   of   analysis   is   most   often   used   when   the   dataset   contains   more   than   one   explanatory   or   response   variable   or   even   both.   Kaggle’s   West   Nile   Virus   dataset   contains   one   response   variable   and   12   explanatory  variables.         Using   a   multivariate   analysis   for   such   a   dataset  is  desirable  because  the  final  outcome   of   accurately   predicting   the   presence   or   absence  of  WNV  might  be  influenced  by  more   than   one   attribute.   For   instance,   principal   component   analysis   can   be   used   to   “decompose   a   data   table   with   correlated   measurements  into  a  new  set  of  uncorrelated   (i.e.,   orthogonal)   variables”   (Abdi,   p.1).   Performing  PCA  will  determine  the  dominant   trends  in  the  dataset  upon  which,  for  example,   a  logistic  regression  model  can  be  applied.       Conducting  a  logistic  regression  alone  with  12   explanatory   variables   may   not   produce   a   stable   model   if   there   is   a   strong   dependence   between   predictors.   PCA   addresses   the   issue   of   multicollinearity   resulting   in   a   regression   model  that  accurately  estimates  the  response   variable.   Therefore,   the   advantages   and   disadvantages   of   using   one   technique   in   conjunction   with   another   in   light   of   the   number   of   explanatory   variables   offers   a   purpose  to  use  multivariate  analysis.       DATA  COLLECTION     The   dataset   provided   by   the   Chicago   Department   of   Public   health   and   NOAA   [National   Oceanic   and   Atmospheric   Administration]   comprises   of   weather   data2,   GIS   data3,   date   of   traps   set   [spanning   3   days   each   week   for   approximately   5   months],   location   of   traps   and   species   for   the   years   between  2007  and  2014.  The  main  dataset  is   broken   into   two   sets   of   data   that   is   the   training  and  the  testing  dataset.  The  training   dataset   reflects   data   points   collected   for   the   odd   years:   2007,   2009,   2011   and   2013.   Whereas,   the   testing   dataset   consists   of   data   points   gathered   for   the   even   years:   2008,   2010,  2012  and  2014.       There  are  two  central  factors  that  serve  as  the   premise  for  when  and  why  the  WNV  data  was   collected.   The   first   factor   is   weather.   “It   is   believed  that  hot  and  dry  conditions  are  more   favorable   for   West   Nile   virus   than   cold   and   wet.”   (Kaggle,   information   description,   para.   9)  Therefore,  the  dataset  captures  information   about   weather   [from   station   1   –   Chicago   O’Hare  International  Airport  –  and  station  2  –   Chicago   Midway   International   Airport]   only   for   the   months   of   late   May   through   early   October.   The   second   factor   is   the   availability   of  data  for  the  number  of  mosquitos’  trapped,   location,  species  identified  and  the  test  results   of   the   presence   or   absence   of   the   West   Nile   virus.   “Every   year   from   late-­‐May  to   early-­‐ October,   public   health   workers   in   Chicago   setup  mosquito  traps  scattered  across  the  city.   Every   week   from   Monday   through   Wednesday,  these  traps  collect  mosquitos,  and   the   mosquitos   are  tested   for   the   presence   of   West   Nile   virus   before   the   end   of   the   week.”   (Kaggle,  information  description,  para.  3)     It  is  no  coincidence  that  traps  are  only  set  out   in   late   spring   through   early   fall   when   the   weather   is   conducive   to   the   population   growth  in  mosquitos.  Identifying  the  location                                                                                                                   2  Weather data has been collected only for dates on which the traps were set 3 GIS data for spraying is only available from 2011 to 2013,  
  • 4.     4   of   the   traps,   the   number   of   mosquitos’   trapped,   the   species,   and   the   frequencies   of   each  species  infected  or  not  infected  with  the   virus  in  conjunction  with  weather  is  crucial  in   understanding   where   the   next   sporadic   growth  of  the  mosquitos  will  occur.  After  all,   the  goal  of  the  predictive  model  is  to  identify   the   presence   or   absence   of   the   WNV   by   predicting   the   occurrence   and   the   rate   of   mosquito   growth   in   one   particular   location   over   another   given   a   set   of   weather   conditions.   Such   predictions   can   be   used   by   the   City   of   Chicago   and   CPHD   “to   efficiently   and   effectively   allocate   resources”  to   control   the  population  growth  of  mosquitos  which  in   turn   prevents   the   transmission   of   the   “potentially  deadly  virus.”     DATA  MERGING     The   West   Nile   training   dataset   does   not   contain   the   weather   variables   required   for   a   robust   analysis.   Therefore,   the   weather   dataset   has   been   merged   with   the   train   file   resulting   in   a   merged   file   titled   “wnv.train.weather.”   The   unique   identifier   used  to  merge  both  files  are  date  and  station.       Since   the   NOAA   Weather   dataset   provides   weather   data   from   two   weather   stations   located   in   the   Greater   Chicago   Area,   the   distance   was   calculated   from   the   site   of   individual   traps   to   each   of   the   two   weather   stations   and   was   used   to   select   the   appropriate   weather   information   for   each   training  record  based  on  the  proximity  of  the   two   weather   stations.   Two   distance   metrics   were   considered:   1)   Euclidean   distance   formula,       𝐷 = (𝑙𝑎𝑡!"#"$%& − 𝑙𝑎𝑡!"#$)! + (𝑙𝑜𝑛𝑔!"#"$%& − 𝑙𝑜𝑛𝑔!"#$)!     as   well   as   2)   Haversine   formula   (http://en.wikipedia.org/wiki/Haversine_for mula)  when  taking  into  account  the  curvature   of  the  Earth,         The   “geosphere”   R   package   was   used   to   calculate  the  Haversine  formula  for  distance.       NEW  FEATURES     Ruiz  et  al.  (2010)  reported  the  importance  of   temporal   characteristics   of   weather   in   predicting  infection  rates  of  WNV  in  Northern   Illinois.   For   example,   they   found   a   positive   correlation   at   1   to   3   week   lags   between   precipitation  and  infection  rates.  Based  on  this   research   new   features   were   created   to   capture   this   information   in   the   weather   dataset,   namely   a   2   week   moving   average   of   precipitation  as  well  as  a  2  week  moving  sum   of  accumulated  rainfall.       Also,   time-­‐based   effects   of   temperature   was   explored  and  this  entailed  the  use  of  a  metric   known   as   growing   degree   days   (GDD)   to   measure   heat   accumulation   used   to   predict   mosquito   development   rates.   GDD   was   calculated  as     𝐺𝐷𝐷 =   𝑇!"#$ − 𝑇!"#$,  𝑖𝑓  𝑇!"#$ >   𝑇!"#$ 0,                                                                  𝑖𝑓    𝑇!"#$ ≤   𝑇!"#$     where   Tbase   represents   a   threshold   temperature  where  an  organism’s  growth  rate   is   near   zero.   From   reviewing   literature,   Tbase   can   range   between   13°C   and   33°C.   We   will   vary  Tbase  and  observe  the  threshold  value  that   yields  the  best  performing  model.       Other   features   that   were   created   from   the   base   training   data   include   the   specific   week   number   of   a   year.   It   is   expected   that   the   abundance   of   mosquitos   and   consequently,   the   presence   of   WNV,   to   be   more   prevalent   during  certain  times  of  the  year.  Therefore  it  
  • 5.     5   is   surmised   that   the   week   number   will   be   important  in  predicting  the  timing  of  WNV.       CATEGORICAL  VARIABLES     Dealing   with   categorical   variables   can   pose   certain  limitations.  For  example,  if  a  variable   in  a  given  data  set  contains  several  categories   there  arises  a  need  to  re-­‐categorize  the  classes   into  smaller  groups  for  the  sake  of  simplicity   and  the  robustness  of  the  predictive  model.  In   addition,   depending   on   the   data   mining   technique  used  the  need  to  use  numerical  data   than  categorical  data  becomes  eminent.           The   categorical   variables   found   in   the   WNV   dataset   have   undergone   transformations   in   the   form   of   re-­‐categorization.   For   instance,   variable   species   is   categorical   with   seven   classes  as  indicated  in  the  table  below:     Table  1  Species   However,   table   1   species   indicates   that   3   species   specifically   have   been   tested   positive   for   WNV.   Re-­‐categorization   highlights   the   importance   of   the   three   classes   associated   with  WNV  leaving  the  other  four  classes  to  be   grouped  in  a  category  of  its  own  indicative  of   the  lack  of  attribution  to  the  spread  of  WNV4.   It   is   also   important   to   note   that   the   training   set   has   a   class   titled   “uncategorized.”   By   creating   the   fourth   category   called   “Culex   Other”  the  issue  of  the  unidentified  species  is   addressed  effectively.                                                                                                                       4  Table 2 titled Species 2 contains the new groupings     The   re-­‐categorization   approach   has   been   applied  to  the  variable  date  as  well.       EXPLORARTORY  DATA  ANALYSIS     One  of  the  prime  focus  of  an  exploratory  data   analysis   is   to   check   whether   the   specific   characteristic(s)   of   a   data   set   meets   the   requirements  of  the  modeling  technique(s)  to   be   used   as   some   models   maybe   sensitive   to   certain  types  of  data.    That  is,  how  is  the  data   set  distributed?     Skewedness   of   a   distribution   whether   it   is   positive   or   negative   is   often   a   result   of   a   “subset   of   observations   that   appear   to   be   inconsistent  with  the  remaining  observations   that  follow  a  hypothesized  distribution.”  (Sim   et  al,  2005,  pg.642).  Histograms  and  box  plots   are  graphical  tools  widely  used  to  inspect  the   data   for   the   presence   of   outliers.   There   are   two   important   questions   to   address   after   visually   inspecting   the   boxplot:   first,   is   it   possible  for  the  boxplot  to  incorrectly  declare   certain   points   as   outliers.   Second,   does   the   presence   of   outliers   imply   the   need   for   a   transformation?         The  graphical  representation  of  the  box  plots5   for  the  West  Nile  dataset  has  identified  certain   variables   to   be   skewed   with   the   presence   of   outliers.   For   instance,   the   distribution   of   the   number   of   mosquitos   is   right   skewed.   The                                                                                                                   5  All   histograms   and   box   plots   with   short   description   of   shape,   center   and   spread   for   the   WNV  data  set  can  be  found  in  the  appendix.    
  • 6.     6   distribution   being   pulled   to   the   right   by   the   largest   number   in   the   data   set   for   the   respective   column.   The   IQR6  rule   for   outliers   indicates   that   values   lying   below   -­‐20   and   above   39.5   are   potential   outliers.   On   examining   the   number   of   mosquitos   trapped   for   each   species   it   is   apparent   that   class   imbalance   plays   an   important   role   in   the   skewedness  of  the  data  as  shown  in  Table  2.     Table  2:  Number  of  Mosquitos  Trapped   All  numbers  above  39.5  represent  the  species   attributed  to  the  WNV  and  the  location  where   it  abounds.  There  exists  a  pattern  between  the   type  of  species,  the  location  and  the  number  of   mosquitos  trapped  that  is  beyond  the  scope  of   the  boxplot.     Similarly  the  boxplot  for  most  of  the  weather   variables   in   the   WNV   dataset   shows   the   presence   of   outliers.   However,   yearly,   monthly,   weekly   and   daily   variations   in   weather   are   infinite   and   the   differences   in   data  points  for  station  1  and  2  can  be  due  to   the   geographical   locations   of   the   stations   and/or   the   way   in   which   the   instruments   record  the  temperatures.       The   Natural   Resources   Management   and   Environment   Department   furthers   this   argument   by   stating   that   “weather   data   collected  at  a  given  weather  station  during  a   period   of   several   years   may   be   not   homogeneous,  i.e.,  the  data  set  representing  a   particular   weather   variable   may   present   a                                                                                                                   6  The  appendix  contains  a  table  titled  “Lower  and   Upper  Bound  Outliers”     sudden   change   [from   one   weather   station   to   another].  This  phenomenon  may  occur  due  to   several   causes,   some   of   which   are   related   to   changes   in   instrumentation   and   observation   practices,   and   others,   which   relate   to   modification  of  the  environmental  conditions   of  the  site”  or  even  “change  in  the  time  of  the   observations.”  (para.14)     Thus,  the  skewedness  of  the  distribution  is  not   necessarily   a   consequence   of   extreme   data   points.   However,   it   is   a   result   of   class   imbalance.  For  instance,  the  histogram  for  the   accumulated   degree   day   shows   that   distribution  is  skewed  to  the  right.  But  when   the   histogram   is   constructed   taking   into   consideration  the  presence  or  absence  of  WNV   it   becomes   clear   that   imbalanced   class   is   the   root   of   the   skewedness   as   seen   in   the   histograms  below:           The   histograms   show   that   there   are   no   wnvpresent   at   lower/higher   degree   days.   However,  the  histograms  for  acc.deg.day  when   wnvpresent  =  0  or  1  and  0  appears  to  be  more   flat.  In  order  to  remove  distribution  skewness   the   data   points   was   replaced   by   the   square   root.   Thus   resulting   in   a   data   that   is   better   behaved  than  in  its  original  units.      
  • 7.     7   In   addition   to   skewness,   another   factor   that   affects  the  predictive  capability  of  a  model  is   the  presence  of  outliers.  As  noted  earlier,  the   weather  data  consists  of  outliers.  “For  a  large   dataset,  removal  of  samples  based  on  missing   values   is   not   a   problem,   assuming   the   missingness   is   not   informative”   (Kuhn   &   Johnson,  2013,  p.41).  However,  a  more  robust   way   of   handling   missing   information   is   by   imputation.    “Imputation  is  layer  of  modelling   where  missing  values  are  estimated  based  on   other   predictor   variables.   This   amounts   to   a   predictive   model   within   a   predictive   model”   (Kuhn  &  Johnson,  2013,  p.42).       Missing   values   in   the   weather   data   set   have   been  addressed  by  the  implementation  of  hot   deck  imputation  where  each  missing  value  is   replaced   with   an   observed   value   from   a   similar  unit.  “An  attractive  feature  of  the  hot   deck  imputation  is  that  only  plausible  values   can   be   imputed   since   values   come   from   observed   responses   in   the   donor   pool”   (Andridge   &   Little,   2011,   para.   3)   which   means  that  the  weather  data  is  more  likely  to   be   similar   to   the   other   data   points   than   imputing   averages.   The   second   advantage   of   using  hot  deck  imputation  is  that  the  “method   does  not  rely  on  model  fitting  for  the  variable   to   be   imputed   and   thus   is   potentially   less   sensitive   to   model   misspecification   than   an   imputation   method   based   on   a   parametric   method   such   as   regression   imputation”   (Andridge  &  Little,  2011,  para.  3).     CORRELATION  ANALYSIS     There  are  specific  variables  in  the  dataset  that   reveal   interesting   patterns   such   as   the   number   of   mosquitos,   temperature   and   precipitation.       The  goal  of  the  correlation  analysis  was  to  plot   or   capture   a   trend   that   would   explain   the   relationship   between   the   variables   and   the   presence   of   the   West   Nile   Virus.   Since   the   variables  are  on  different  scales  the  variables   were  normalized  using  the  Z  score  formula.  In   addition   to   normalizing   the   data,   average   values   of   the   said   variables   were   considered   in  building  the  plots.     The  plots  pertain  to  weekly  records  captured   for  4  years:  2007,  2009,  2011  and  2013  for  the   months  between  late  May  and  early  October.   Individual   plots   have   been   drawn   for   each   year.     The  blue  line  shows  the  average  precipitation.   The   red   line   shows   the   average   number   of   mosquitos,   the   green   line   shows   the   average   temperature   and   the   purple   line   shows   the   presence  of  the  virus.             Figure  1:  2007   According  to  the  line  graph  for  the  year  2007,   a   sudden   decrease   in   temperature   causes   mosquitos   to   decrease   after   week   35.   Consequently,  the  average  number  of  detected   virus  decreases.       It   was   also   noted   that   the   higher   the   temperature   and   the   precipitation   gets,   the   higher   the   number   of   mosquitos   and   subsequently   the   higher   the   probability   for   the  presence  of  the  West  Nile  virus.       An   interesting   pattern   was   found   between   precipitation  and  the  increase  in  the  number  
  • 8.     8   of  mosquitos.    The  increase  in  the  number  of     Figure  2:  2009   mosquitos  occurs  rapidly  not  during  the  week   of  high  precipitation  but  in  the  week  after.    It   appears  that  once  the  numbers  of  mosquitos’   increase.  Then  the  virus  infects  the  mosquitos.       The  number  of  mosquitos  in  week  35  is  low.   However,   the   graph   shows   that   the   presence   of   the   virus   is   prominent   than   before   indicating   that   all   of   the   mosquitos   have   the   virus   in   their   blood   although   the   mosquito   population  is  small.       Not  surprisingly,  as  the  temperature  declines   rapidly   [even   with   high   precipitation],   the   number   of   mosquitos   and   the   presence   of   WNV   drops.     All   plots   have   captured   similar   trends.       Figure  3:  2011     Figure  4:  2013     The  scatterplots  below  shows  that  the  number   of  mosquitos  and  the  presence  of  WNV  has  a   positive   relationship   with   dmonth,   dweek,   dewpoint,   cool,   tmax,   tmin,   tavg   and   spray.   Therefore,   the   model   will   certainly   rely   on   these  features  more  than  the  others  to  predict   WNV.           Though   the   relationships   are   positive   the   strength   however,   appears   to   be   weak.   A   closer   look   at   the   scatterplots   shows   some   evidence   of   multicolinearity.   For   instance,   in   the   plot   titled   temp   and   weather   there   are   blocks   of   strong   positive   correlations   that   indicate   colinearity.     An   issue   to   consider   in   the  modeling  process.       MODELS     Accurately   predicting   the   presence   of   WNV   essentially   amounts   to   selecting   the   best   spatial,   temporal   and   weather   features   along   with   a   specifically   tuned   classification   algorithm.   It   is   evident   from   the   exploratory   analysis  as  well  as  from  literature  that  certain   individual   features   are   crucial   in   predicting   WNV.       Therefore,  the  modeling  process  for  this  data   set  will  be  broken  into  two  parts.  Part  I,  will   focus  on  determining  how  to  best  incorporate   the   available   features   into   a   classification   model.    Part  II,  will  focus  on  investigating  and  
  • 9.     9   fine   tuning   the   specific   classification   algorithms   to   yield   the   best   possible   prediction.         Part  I     Weather  Data  and  Principal  Component   Analysis     Due   to   the   number   of   weather   attributes   available   to   the   researcher   in   the   dataset,   it   becomes   quite   difficult   to   ascertain   the   combination  that  will  result  in  the  best  model.   Moreover,  the  nature  of  weather  is  such  that   most  individual  features  will  be  correlated  to   another   resulting   in   multicolinearity.   For   example,   the   amount   of   precipitation   will   be   correlated   to   atmospheric   pressure   and   in   turn,  be  correlated  to  temperature.    Therefore   to   combat   multicolinearity   principal   component  analysis  (PCA)  was  used  to  extract   features   that   highlight   the   similarities   and   differences  of  the  original  weather  data  while   eliminating   the   detrimental   effects   that   can   result  from  the  linear  dependency  of  predictor   variables.         Figure   5   summarizes   the   results   of   PCA   conducted  on  the  weather  attributes.  The  first   five  components  capture  97%  of  the  variation   in   the   weather   data.   The   loadings   of   component   1   suggest   it   is   highly   related   to   temperature,   humidity   and   pressure;   a   large   value   for   component   1   seems   to   represent   a   sunny  but  chilly  day.  Component  2  appears  to   capture  wind  information,  while  component  3   summarizes   precipitation.   The   first   5   components  from  PCA  will  be  used  to  reflect   the  weather  conditions  of  a  specific  day  in  the   data.             Figure  5:  PCA   Figure  6:  Clustering                                
  • 10.     10   Figure  7:  Model  Summary     Temporally  based  weather  variables  and  week   number     While  the  weather  conditions  of  a  specific  day   can   affect   the   activity   level   of   mosquitos   for   that   day,   it   does   not   take   into   account   a   mosquito’s  life-­‐cycle  or  the  timing  of  weather   conditions   and   its   effect   on   mosquito   populations.  Hence,  engineered  features  such   as   growing   degree   day,   moving   temperature   averages/sums   and   moving   precipitation   averages/sums   (all   mentioned   in   previous   sections)  will  be  included  in  the  model.       Also,   week   numbers   of   the   year   will   be   incorporated   to   capture   the   inter-­‐annual   timing  of  mosquito  populations.       Clustering  Location  Data     Determining  a  good  way  to  represent  location   will  most  likely  improve  the  predictive  power   of   the   models.   Although,   the   WNV   challenge   provides  raw  longitude  and  latitude  values  to   represent  location,  it  is  believed  to  not  be  in  a   form   that   will   be   conducive   to   predictive   modeling   due   to   the   non-­‐linear   nature   of   spatial  data.       Thus  k-­‐means  algorithm  (k  =  20)  was  used  to   translate   the   location   data   represented   by   longitude/   latitude   pairs   into   clustered   locations.   Figure   6   shows   the   location   of   the   clusters  using  a  normalized  scale.       As   one   can   observe,   the   clustered   locations   outline   the   Chicago   area   quite   accurately.   These   clustered   locations   will   be   used   as   a   categorical  variable  in  our  models.       Part  II     With   the   necessary   data   pre-­‐processing   and   variable   transformations   completed.   The   focus   was   moved   onto   the   construction   of   models  to  predict  WNV.  The  overall  approach   was  to  build  an  ensemble,  a  model  that  takes  a   weighted   average   of   a   set   of   classifiers   that   generally   outperforms   the   individual   classifiers   upon   which   the   ensemble   is   built   from.   The   strategy   was   to   consider   five   individual   algorithms   and   build   the   best   possible  classifier  out  of  each  to  include  in  the   final   ensemble   model:   1)   logistic   regression   using  a  generalized  additive  model  (GAM),  2)   linear  discriminant  analysis  (LDA),  3)  random   forests,   and   4)   support   vector   machines   (SVM).  Kaggle’s  train  dataset  was  split  by  70%   and   30%   probabilities   where   the   70%   was   used   as   the   training   set   and   the   remaining   30%   served   as   the   hold   out   for   the   test   dataset.       Figure  7  is  a  summary  of  all  the  best  set-­‐ups   for   each   algorithm.   Of   all   the   individual   models,  GAM  was  clearly  the  best  performing   with   an   AUC   value   of   0.8253717.   The   best   overall   ensemble   classifier   was   a   weighted   average  of  GAM  and  SVM  with  weights  of  0.6   and   0.4,   respectively,   and   an   AUC   of   0.8361962.        
  • 11.     11   CONCLUSION     Although  the  ensemble  model  had  the  highest   AUC  value  achieved  in  the  training  dataset,  it   only  reached  an  AUC  of  0.6220  on  the  Kaggle   leaderboard.         In   fact,   over   50   models   were   submitted   to   Kaggle   and   the   results   were   rarely   as   expected.   The   two   best   models   on   the   leaderboard  consisted  of  an  ensemble  of  GAM   logistic  regression  and  GLM  logistic  regression   and   a   slightly   modified   Poisson   GLM   model.   Both   did   not   have   notable   training   AUCs   but   performed  well  on  Kaggle.         Other  validation  techniques  were  investigated   in  an  attempt  to  obtain  better  feedback  from   the   training   process   which   resulted   in   the   build   of   a   better   model.   Instead   of   using   a   70/30   training   and   testing   split,   a   modified   version   of   n-­‐fold   cross   validation   was   used   where  one  year’s  data  was  left  out  as  testing   and   the   remaining   years   were   used   as   training.   This   process   was   repeated   four   times,   once   for   each   year,   and   this   averaged   the   model’s   performance.   The   best   models   achieved   from   this   validation   technique   did   not  seem  any  different  from  the  models  built   on  a  traditional  70/30  split.    Figure  8:  Models  &  Imbalance   Because  there  is  a  gross  imbalance  of  positive   and   negative   cases   in   the   WNV   data   further   examination   was   conducted   to   see   if   the   imbalance   had   any   influence   on   the   effectiveness  of  training  and  validation.  Figure   8   shows   the   performance   of   several   models   and   its   relationship   with   data   imbalance.   Except  for  one  model,  none  displayed  a  drastic   sensitivity  to  data  balance.     If   using   the   appropriate   validation   technique   does   not   account   for   the   disparity   between   training  AUC  and  the  Kaggle  leaderboard  AUC,   it  is  surmised  that  there  may  be  a  fundamental   difference   between   the   characteristics   of   the   training  data  and  testing  data.       Specifically,   it   is   possible   that   there   are   idiosyncratic   intra-­‐annual   variations   in   weather   that   cannot   be   captured   in   the   training   set   due   to   how   the   WNV   problem   is   set   up.   Ezanno   et   al   (2014)   cites   that   population  of  certain  mosquito  species  does  in   fact   have   inter-­‐annual   variations   due   to   specific  weather  events  in  a  year.         It   is   therefore   suspected,   that   the   best   algorithms  discussed  afore  are  over  fitting  the   training   data.   While   the   best   models   in   this   study  capture  the  variations  in  weather  in  the   training  data  well,  it  is  unable  to  replicate  this   in  the  testing  data.       This   intuitively   makes   sense   as   most   of   the   models  that  performed  better  on  Kaggle  tend   to   be   simple   models   that   included   variables   like   location,   week   number   and   mosquito   species  that  is  generalizable  through  all  years   of  the  data.      
  • 12.     12   Other   matter   of   consideration   for   future   model   building   is   the   importance   of   the   spray   data.   Though   the   spray   data   is   not   a   part   of   the   testing  dataset  and  would   warrant   an   immediate   dismissal   from   the   predictor   selection   process,   the   following   heat   map   implies   otherwise.     Upon   close   inspection   of   the   heat   map   one   speculates   that   spraying   one   year   does   indeed  alter  the  effects  of   population  the  next  year,   which   might   explain   why   mosquito   populations  appear  in  different  locations  each   year.       Also,   feature   engineering   of   the   predictor   variable,   depart   [departure   from   normal],   might   help   in   creating   a   deeper   level   of   understanding  the  problem  statement  at  hand.   A  possible  means  of  engineering  this  predictor   would   be   to   categorize   the   deviance   from   temperature  normalcy  as  hotter  than  normal   and  colder  than  normal.                                                                                                                  
  • 13.     13     Appendix     Table  3:  Data  Fields   FIELDS   Number   Train   Weather   Spray   Test   1   Date   Station   Date   ID   2   Address   Date   Time   Date   3   Species   Max  Temperature     Latitude   Address   4   Block   Min  Temperature   Longitude   Species   5   Street   Avg  Temperature     Block   6   Trap   Departure  from  Normal     Street   7   Address  Number   Dew  Point     Trap   8   Latitude   Wet  Bulb     Address  Number   9   Longitude   Heat     Latitude   10   Address  Accuracy   Cool     Longitude   11   #  of  Mosquitoes   Sunrise     Address  Accuracy   12   Wnvpresent   Sunset       13     Code  Sum       14     Depth         15     Water1       16     Snowfall       17     Total  Precipitation       18     Station  Pressure       19     Sea  Level       20     Wind  Speed       21     Wind  Direction       22     Average  Speed      
  • 14.     14   TABLE 2 | SPECIES 2                  
  • 15.     15   SKEWNESS OF VARIABLES & OUTLIERS DATE PATTERN       The data is skewed to the left. There are more records for 2007 than other years but not by a significant amount. If this becomes problematic, we may sample equal number of records for each year.        
  • 16.     16   LATITUDE PATTERN             Shape:  Latitude  is  very  slightly   skewed  to  the  left.  Mean  is  less  than  the   median         Center:  41.84628       Spread: 41.64461 to 42.01743          
  • 17.     17   LONGITUDE PATTERN               Shape:  Longitude  is  symmetric     Center:  -­‐87.69499       Spread: -87.93099 to -87.53163          
  • 18.     18   NUMBER OF MOSQUITOS PATTERN             Shape:   The   distribution   is   right   skewed   as   the   mean   is   12.85351     being   pulled   to   the   right   away   from   the  median  which  is  5     Center:  5     Spread: 1 to 50 Outlier: The boxplot confirms the skewedness of the histogram in that there are large numbers causing the distribution to be pulled to the right. The outlier function indicates the largest number in the data for number of mosquitos is 50              
  • 19.     19   DISTANCE FROM O’HARE PATTERN             Shape:  The  distribution  is  symmetric       Center:  0.2943334     Spread: 0.0372549 to 0.5179756                            
  • 20.     20   DISTANCE FROM MIDWAY PATTERN             Shape:   The   distribution   is   slightly   skewed   to   the   left   as   the   mean   0.1548598   is   pulled   away   from   the   median  0.1616137       Center:  0.1616137     Spread: 0.0077139 to 0.2481943                            
  • 21.     21   MAXIMUM TEMPERATURE PATTERN               Shape:  The  distribution  is  s  skewed   to   the   left   as   the   mean   81.94765   is   pulled   away   to   the   left   from   the   median  83       Center:  83     Spread: 57 to 97 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 57 is the point that is distant from the other values in the dataset.                          
  • 22.     22   MINIMUM TEMPERATURE PATTERN               Shape:  The  distribution  is  s  skewed   to   the   left   as   the   mean   64.16533   is   pulled   away   to   the   left   from   the   median  66       Center:  66     Spread: 41 to 79 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 41 is the point that is distant from the other values in the dataset.                        
  • 23.     23   AVERAGE TEMPERATURE PATTERN                 Shape:   The   distribution   is   skewed   to   the   left   as   the   mean   38.28412  is  pulled  away  to  the  left   from  the  median  40       Center:  40     Spread: 15 to 52 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 15 is the point that is distant from the other values in the dataset.                        
  • 24.     24   TOTAL PRECIPITATION PATTERN           Shape:   The   distribution   is   skewed   to   the   right   as   the   mean   0.1274281   is   pulled   away   to   the   right  from  the  median  0       Center:  0     Spread: 0.00 to 3.97 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 3.97 is the point that is distant from the other values in the dataset.                          
  • 25.     25     RESULT OF WIND SPEED PATTERN               Shape:  The  distribution  is    skewed   to  the  right  as  the  mean  5.911003  is   pulled   away   to   the   left   from   the   median  5.5     Center:  5.5     Spread: 0.1 to 15.4 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 15.4 is the point that is distant from the other values in the dataset.                        
  • 26.     26     RESULT OF WIND DIRECTION PATTERN               Shape:  The  distribution  is  skewed   to  the  left  as  the  mean  17.72016  is   pulled   away   to   the   left   from   the   median  19       Center:  19     Spread: 1 to 36                        
  • 27.     27   AVERAGE WIND SPEED PATTERN                 Shape:   The   distribution   is   skewed   to   the   left   as   the   mean   123.4147  is  pulled  away  to  the  left   from  the  median  139       Center:  139     Spread: 3 to 177 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 3 is the point that is distant from the other values in the dataset.                            
  • 28.     28   TEMPERATURE MOVING AVERAGES - 1 WEEK PATTERN                   Shape:   The   distribution   is   skewed   to   the   left   as   the   mean   72.5431   is   pulled   away   to   the   left   from   the   median  73.14286     Center:  73.14286     Spread: 53.14286 to 83.85714 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 53.14286 is the point that is distant from the other values in the dataset.                                  
  • 29.     29   TEMPERATURE MOVING AVERAGES – 2 WEEK PATTERN                 Shape:   The   distribution   is   skewed   to   the   left   as   the   mean   72.41439   is   pulled   away   to   the   left   from   the   median  73     Center:  73     Spread: 55.07143 to 82.76923 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 55.07143 is the point that is distant from the other values in the dataset.                      
  • 30.     30   MOVING AVGS OF PRECIPITATION – 1 WEEK PATTERN                 Shape:  The  distribution  is  skewed  to   the   right   as   the   mean   0.1333564   is   pulled   away   to   the   right   from   the   median  0.07     Center:  0.07     Spread: -0.0000 to 1.42857 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 1.42857 is the point that is distant from the other values in the dataset.                                      
  • 31.     31   MOVING AVGS OF PRECIPITATION – 2 WEEK PATTERN               Shape:  The  distribution  is  skewed  to   the   right   as   the   mean   0.130   is   pulled   away   to   the   right   from   the   median   0.085     Center:  0.085     Spread: 0.0007 to 0.76714 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 0.76714 is the point that is distant from the other values in the dataset.                                        
  • 32.     32   MOVING SUM OF PRECIPITATION – 1 WEEK PATTERN               Shape:   The   distribution   is   skewed   to  the  right  as  the  mean  0.9432334  is   pulled   away   to   the   right   from   the   median  0.53     Center:  0.53     Spread: -0.000 to 9.149 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 9.15 is the point that is distant from the other values in the dataset.                                          
  • 33.     33   MOVING SUM OF PRECIPITATION – 2 WEEK PATTERN               Shape:  The  distribution  is  skewed  to   the  right  as  the  mean  1.74216  is  pulled   away  to  the  right  from  the  median  1.1     Center:  1.1     Spread: -0.000 to 10.74999 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 10.75 is the point that is distant from the other values in the dataset.                                          
  • 34.     34   DEGREE DAY PATTERN               Shape:   The   distribution   is   skewed   to  the  right  as  the  mean  3.824472  is   pulled   away   to   the   right   from   the   median  3.4     Center:  3.4     Spread: 0.0 to 14.9                                        
  • 35.     35   ACCUMULATED DEGREE DAY FOR EACH YEAR PATTERN               Shape:   The   distribution   is   skewed   to   the   right   as   the   mean   241.0934   is   pulled   away   to   the   right  from  the  median  239.6     Center:  239.6     Spread: 1.3 to 521.1                                      
  • 36.     36   LOWER & UPPER BOUND OUTLIERS                                                    
  • 37.     37   GROUPED LINE GRAPH | YEAR 2007     Blue line: The average precipitation. Red line: The average number of mosquitos Green line: The average temperature. Purple line: The presence of virus                
  • 38.     38   GROUPED LINE GRAPH | YEAR 2009       Blue line: The average precipitation. Red line: The average number of mosquitos Green line: The average temperature. Purple line: The presence of virus.              
  • 39.     39               GROUPED LINE GRAPH | YEAR 2011     Blue line: The average precipitation. Red line: The average number of mosquitos Green line: The average temperature. Purple line: The presence of virus.  
  • 40.     40             GROUPED LINE GRAPH | YEAR 2013     Blue line: The average precipitation. Red line: The average number of mosquitos Green line: The average temperature. Purple line: The presence of virus.    
  • 41.     41   Works Cited Abdi, Herve. Multivariate analysis. Retrieved from www.utdallas.edu/~herve/Abdi-MultivariateAnalysis-pretty.pdf Andridge & Little. (2011). A review of hot deck imputation for survey non – response Int Stat Rev. 78(1): 40-64. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/ Ezanno, P, Aubry-Kientz, M et al. (2015). A generic weather driven model to predict Mosquito population dynamics applied to species of anopheles, culex And aedes genera of southern France. 120(1): 39-50. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/25623972 Kaggle. West Nile Prediction. Retrieved from: https://www.kaggle.com/c/predict- west-nile-virus/data Kuhn & Johnson (2013). Applied Predictive Modeling. New York, Springer. Natural Resources Management and Environmental Departments. Annex 4: Statistical Analysis of Weather Data Sets 1. Retrieved from: http://www.fao.org/docrep/x0490e/x0490e0l.htm#TopOfPage Ruiz, Marilyn O., F Chavez Luis et al. (2010). Local impact of temperature and precipitation on west Nile virus infection in culex species mosquitoes in northeast Illinois, USA. Parasites & Vectors. Retrieved from http://www.parasitesandvectors.com/content/3/1/19. Ruiz, Marilyn 0., Edward D. Walker et al.(2007). Association of west nile virus illness and urban landscapes in Chicago and Detroit. International Journal of Health Geographics. Theophilidies, C.N., S.C. Ahearni et al. (2006). First evidence of west nile virus amplification and relationship to human infections. International Journal of Geographical Information Science, 20, 103 -115. Sim, C.H, Gan, F. F. et al (2005), Outlier: labeling with boxplot procedures. Journal of American Statistical Association, 100(470). Retrieved from: http://www.jstor.org/stable/27590584