This document discusses techniques for handling missing data in statistical analysis and modeling. It compares different modeling approaches on three datasets - one on shoe preferences from a stated preference survey, one on diabetes risk factors, and one on homeowner characteristics. It finds that classification and regression tree (CART) and multivariate adaptive regression splines (MARS) techniques are preferred for imputing missing values when the data contains mixed variable types and interactions among variables. CART can sequentially impute missing values for each variable while preserving the multivariate structure of the data.
1. Ingo Bentrott
School of Marketing
University of Technology, Sydney
2. “Vinod Shetty, of Mumbai, secretary of the newly formed
Young Professionals Collective, said staff were subject to so
much abuse that thousands of its workers were quitting in
despair. The problem has become so bad that remaining
workers are being forced to extend their shifts to 12 to 13
hours a day to fill the gaps.
Although a call centre worker in India earns about $70 a week-
twice as much as most professionals in a nation suffering
chronic underemployment- up to 60 per cent leave their jobs
each year.”
3. (insert graph and logistics regression)
If you run a logistic regression for BUY using
the data on the left, you will get a response
like the graphic on the right
This is due to the well known issue of
Listwise Deletion (LD)
4. There are two types of non-response: complete non-
response, where the person does not participate at all in
the survey and item non-response where a survey is only
partially completed
Coleman (1991) mentioned that the rates of non-
responses have remained constant but Jarvis (2002) says
the rates are increasing when you control for answering
machines.
Respondents have grown to „strongly dislike‟ phone
surveys.
◦ The primary concern is privacy, which has been made worse by
well-publicized breaches in security (Jarvis 2002)
In essence, whenever you have missing data in your data,
you are forced to somehow address it
◦ Delete or Impute
5. Missing data can be of three types
◦ Missing Completely at Random (MCAR)
Missings are unrelated to the value of x or any other
variable
◦ Missing at Random (MAR)
Missing not a function of x when „controlled for other
variable effects.‟
◦ Non-ignorable missing
Missing caused by an unmeasured variable
6. Most current discrete choice studies are using stated preference
designs
◦ Creates orthogonal Xs
This is a way to reduce the number of respondents by getting as
much data as possible out of fewer respondents
Discrete choice studies based on Random Utility Theory (RUT) can
give you excellent estimation of willingness to pay estimates (WTP)
◦ Is necessary to have complete cases for low variance estimation
If data is collected by same survey instrument, it is likely to have the
same missing pattern across the Xs (Howell, 1998).
Revealed Preference (RP) data usually has multicollinearity issues
and the use of missing data indicators will exacerbate this issue.
7. (insert graph)
From our example a bit ago, using most
multiple imputation techniques would still
have problems imputing a value for USER
RATING above.
If the only variables that can be used are AGE,
INCOME and POST CODE, missings would be
a linear combination of these
8. Many statistics packages use Listwise Deletion (LD) by default when
estimating a discrete choice model.
◦ In SEM models, VAR-COV matrix only uses valid data for
estimation
Leads to selection bias and estimates with reduced efficiency
If data is MCAR, only penalty is loss of power
Mean Imputation takes multiple imputes to the same data point and
averages the results
◦ MI is a main-effects only model, CART/MARS use interactions so
we may not need multiple imputes
“Hot Deck” imputation (Little and Rubin, 1987) is a technique when
you use values based on similar cases (similar to surrogates in
CART)
9. Expected Maximization (EM) has been successfully
applied to missing data but standard errors must be
obtained using auxiliary methods.
◦ Missing imputed during EM
FIML and ML methods assume multivariate normality
◦ These techniques are best when there are a few, distinct
patterns of missing data (Little, Schnabel, Baumert, 2000).
If the data is MAR and not MCAR all the above
techniques will be biased
◦ Since MAR implies another „observed‟ explanatory variable
is affecting the missing, interactions in CART/MARS can
pick this up.
10. Most missing data tends to act in combination (Borgoni
and Berrington, 2004)
We should not try to “break” the multivariate nature of the
data.
◦ CART uses surrogates, so even though we impute data one
variable at a time, the structure will be preserved.
Most imputation techniques assume multivariate normal.
Imputation sometimes assumes data is MCAR but if the
data has high degree of interactions and non-monotonic,
CART, by its nature will perform better on data that is MAR
EM algorithm has been proven to be good but implies
missings only during estimation
◦ CART technique can fill the dataset for later analysis.
11. If data has high dimensionality and data sparseness, univariate
nature of CART will be better able to handle this than Multiple
Imputation using regression.
Trees are also less prone to outliers and misspecified models
Although a multiple iteration tree is shown to be better in Monte
Carlo studies by using multiple draws from CARTs conditional
distribution (Borgoni and Berrington, 2004), the results are within a
standard error of the “one shot” variable at a time CART imputation
technique.
◦ One shot has some added variability (like other techniques) but
standard errors may be underestimated.
◦ Extra information gathered from imputation may offset extra
variability
If the data is MCAR, using a simple Pearson Chi Square test of
Observed versus Expected values validates the imputed values.
12. (insert table of Descriptive Statistics)
The diagnostic, binary-valued variable investigated is whether
the patient shows signs of diabetes according to World Health
Organization criteria (i.e., if the 2 hour post-load plasma
glucose was at least 200 mg/dl at any survey examination or
if found during routine medical care). The population lives
near Phoenix, Arizona, USA.
13. (insert table of Descriptive Statistics)
This is a dataset with information about renters
and homeowners. The dataset is a good mixture of
categorical and continuous variables with a lot of
missing data.
14. This survey is aimed at gathering some
information about your preferences for athletic
shoes. More specifically, the product in question
is an athletic shoe that is to be used primarily for
playing a sport (or several sports). For example,
the shoes could be used for playing basketball,
tennis, running, hiking, and so on.
Since the questions asked are from a balanced
stated preference (SP) design, there are only
missing values in the demographic questions
16. This presentation looks at 5 different modeling techniques on
the 3 datasets mentioned previously.
Model 1. The first model was a simple logistic regression using
all variables
◦ No transformations
◦ Listwise deletion was used for missing values
Model 2. A MARS model was then run with main effects only and
all model defaults
◦ Since the data is binary, this is a Linear Probability Model (LPM)
Model 3. Mean imputation was used in a logit model
Model 4. MARS basis functions were then put into logistic
regression to recover standard errors and eliminate the need for
weighted least squares in LPM
17. Step 1. Sort the variables with missing values from least to
worst
Step 2. Starting with the least missing variable, partition
the data into one data set with that variable‟s missing
values and one data set with complete cases
Step 3. Estimate a tree with the least missing variable as a
target
Step 4. Score the data set with missing values from the
results in step 3
Step 5. Repeat for the next affected variable until all data
is filled
18. (insert graph)
Regression by logit will yield a different shape
than a linear probability model
Some cases will be classified differently using
the same basis functions from MARS
21. The data on Shoe buyers is “real” in that it was an
SP study that was deployed
The nature of orthogonal design forced trade offs
and controls for interactions
The Pima Indian and Home Owner dataset are
well known and has well defined patterns
amongst the Xs
If the buyers are the class of interest, a
CART/MARS imputation is clearly preferred
22. CART and MARS will perform better on mixed data types and
should be the preferred imputation modeling technique
◦ Possible CART MARS Logit technique to capture all possible non-
monotonics
Web based surveys allow us to see when people quit survey
Can investigate if the person looked at all questions and refused
some
◦ In mail surveys, this is impossible
◦ The web will expand our missing data categories as a complete survey,
means someone that viewed and answered all the questions (Bosnjak and
Tuten, 2001)
If survey respondents are paid, this still works best for reducing
non-response
◦ CART can be used with ROC/Lifts charts to see what is optimal amount of
payment per completed survey
◦ Many companies would be willing to pay for this completeness (Coleman,
1991)