D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Imputation of Missing Data through Bayesian Approach
Pratibha Jalui
Cytel Statistical Software & Services Pvt. Ltd, Pune
Email: pratibha.jalui@cytel.com
Reetabrata Bhattacharyya
Tata Consultancy Services Limited, Mumbai
Email: reetabrata.b@tcs.com
Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th
- 10th
Oct, 2015 1 / 24

Overview
1 Introduction and Background
2 Mechanisms
3 Motivation
4 Objective
5 Data and Methods
6 Results
7 Conclusion and Discussion
- 10th
Oct, 2015 2 / 24

Why talk about Missing Data?
Randomized clinical trials - primary tool for
evaluating new medical interventions.
More than $7 billion spent every year in evaluating
drugs, devices, and biologists of which a substantial
percentage of outcomes of interest is often missing.
Missingness reduces the beneﬁt provided by
randomization - introduces potential biases in
comparison of the treatment groups.
As large as 65% of articles in PubMed journals do
not report the handling of Missing data.
Health Authorities encourage better approaches to
handle missing data
- 10th
Oct, 2015 3 / 24

Why talk about Missing Data?
Randomized clinical trials - primary tool for
evaluating new medical interventions.
More than $7 billion spent every year in evaluating
drugs, devices, and biologists of which a substantial
percentage of outcomes of interest is often missing.
Missingness reduces the beneﬁt provided by
randomization - introduces potential biases in
comparison of the treatment groups.
As large as 65% of articles in PubMed journals do
not report the handling of Missing data.
Health Authorities encourage better approaches to
handle missing data
"The only really good solution to the missing data problem is not to have any" - Paul Allison
- 10th
Oct, 2015 3 / 24

How do we deﬁne Missing Data?
Missing Data
Data that were planned to be recorded but are not available.
Broadly two types of missing data which are as follows:
Monotone missing data
All data for a subject are missing after a certain time-point.
Serious problem in interpreting the results of a trial.
Non-monotone or intermediate missing data
A subject misses a visit but contributes data at later visits.
- 10th
Oct, 2015 4 / 24

Types of Missingness
1. Missing Completely at Random (MCAR)
Missingness is independent on observed and unobserved data.
Example:
• Patient moving to another city for non-health reasons. Patients who drop
out from a study for this reason could be considered a random and
representative sample from the total study population.
2. Missing at Random (MAR)
Missingness depends on observed data.
Example:
• Dropout due to previous lack of efﬁcacy could be MAR, because in some
sense predictable from the observed data in the model.
• Men may be more likely to decline to answer some questions than women.
- 10th
Oct, 2015 5 / 24

Types of Missingness
3. Missing Not At Random (MNAR)
Missingness is not independent in unobserved data, even after
accounting form the observed data.
Difficult to model
Example:
• It may happen that after a series of visits with good outcome, a patient
drops out due to lack of efficacy. In this situation the analysis model based
on the observed data, including relevant covariates, is likely to continue to
predict a good outcome, but it is usually unreasonable to expect the patient
to continue to derive benefit from treatment.
• Individuals with very high incomes are more likely to decline to answer
questions about their own income.
- 10th
Oct, 2015 6 / 24

The Effect of Missing Values on Analysis and Interpretation
The following problems may affect the interpretation of the trial results when
some missing data are present.
Power and Variability
• Power of a trial will increase if the sample size is increased or if the
variability of the outcomes is reduced.
Bias
• Risk of bias in the estimation on the treatment effect from the observed
data depends upon the relationship between missingness, treatment and
outcome.
• Type of bias that can critically affect interpretation will depend upon
whether the objective of the study is to show a difference or demonstrate
non-inferiority/equivalence.
- 10th
Oct, 2015 7 / 24

Goals of Statistical Analysis with Missing Data
Goals of Statistical Analysis:
Minimize bias
Maximize use of available information
Obtain appropriate estimates of uncertainty
Key points to keep in mind:
Research question (i.e. the hypothesis under investigation)
Information in the observed data
Reason(s) for missing data
As statisticians/programmers we need to:
Consult with Investigators to design to minimize missing data/ infor-
mation, postulate plausible missingness, perform valid analysis and
interpret the results.
- 10th
Oct, 2015 8 / 24

What do the Regulatory Bodies (FDA/EMEA) recommend?
Avoid Missing Data wherever possible
Protocol to address potential impact and treatment of anticipated missing
data
Design strategies to minimize treatment and analysis dropouts
Continue to collect information on key outcomes on participants who
discontinue -record and use it for analysis
Set a minimum rate of completeness for the primary outcome(s), based
on similar past trials
Specify Statistical methods and assumptions for handling missing data in
protocols such a way that is understood by clinicians
Focused efforts on training staff
- 10th
Oct, 2015 9 / 24

What do the Regulatory Bodies (FDA/EMEA) recommend?
Avoid Single imputation methods like LOCF and BOCF as the primary
approach to the treatment of missing data unless underlying assumptions
are scientifically justified.
Parametric models, random effects models to be used with caution -all
assumptions clearly stated - accompanied by goodness-of-fit procedures.
Weighted generalized estimating equations methods be more widely
used as alternative to parametric modeling.
When substantial missing data are anticipated, auxiliary information
should be collected.
Sensitivity analyses mandated as part of the primary reporting of findings
from clinical trials
- 10th
Oct, 2015 10 / 24

Treatments for Missing Data: Traditional Approach
List wise Deletion
• Omit cases with missing data and run analyses on what remains.
Simple Imputation Method - Last Observation Carried Forward
• Subject’s missing responses is equal to their last observed response and it
is developed under Missing Completely At Random (MCAR) framework
• Usually used in longitudinal (repeated measures) studies of continuous
outcomes
Simple Imputation Method - Baseline Observation Carried Forward
• Similar to LOCF but here we assume a patient’s missing responses is
equal to their baseline observed response.
Empirically developed models
• Unconditional and conditional mean imputation
• Best or worst case imputation
• Regression methods and Hot-deck imputation
- 10th
Oct, 2015 11 / 24

Treatments for Missing Data: Modern Approach
Full Information Maximum Likelihood (FIML) model
• Uses pragmatic missing data estimation approach for structural equation
modeling
• Produces unbiased parameter estimates and standard errors under MAR
and MCAR.
• Unlike the maximum likelihood method FIML uses all available
information in all observations.
Mixed-Effect Model Repeated Measure (MMRM) model
• Applies with a Restricted Maximum Likelihood solution to study
longitudinal (repeated measures) analyses under MAR assumption.
• Missing data are not explicitly imputed. No effect on other scores from
that same patient.
- 10th
Oct, 2015 12 / 24

Objective
1 To examine the multiple imputation(MI) approach, speciﬁcally, Bayesian
Markov Chain Monte Carlo (MCMC) random sampling method for the
analysis of incomplete data.
2 To compare the performance of original data using last observation
carried forward (LOCF) and baseline observation carried
forward(BOCF) imputation approaches versus MI through Bayesian
MCMC random sampling method.
- 10th
Oct, 2015 13 / 24

Data : Analytical Background
Testing of treatment (Hypothesis of Interest)
To evaluate the efficacy of Treatment A at Week-16 for change in
Vitreous Haze (VH) score.
Statistical Analysis Plan
The change from baseline to Week-16 in VH score are compared
between treatment groups using an Analysis of Covariance (ANCOVA)
model.
The model are included the fixed categorical effect of treatment groups,
visits and treatment-by visit interaction as well as the fixed continuous
covariate of baseline VH.
The model provides adjusted least square (LS) means estimates at week
16 for both the treatment groups, difference between the means,
corresponding standard error (SE), confidence interval (CI) and p-value.
- 10th
Oct, 2015 14 / 24

Data: Simulation
Simulated hypothetical clinical trial efﬁcacy dataset as an input in order
to perform the MCMC method for missing data imputation.
100 patients are considered with an amount of missing data similar to the
one observed in our real data set.
Missing data pattern is randomly created.
This is an exhaustive simulation study just to demonstrate the application
of Bayesian method for imputing missing value.
A data set simulation is done to obtain a more complete comparison of
the three methods (BOCF, LOCF with MI).
- 10th
Oct, 2015 15 / 24

Methods: Analytical Background
Bayesian Approach
In Bayesian inference, information about unknown parameters is
expressed in the form of a posterior probability distribution.
Markov Chain Monte Carlo (MCMC)
A Markov chain is a sequence of random variables in which the
distribution of each element depends on the value of the previous one.
Through MCMC, we can simulate the entire joint posterior distribution
of the unknown quantities and obtain simulation based estimates of
posterior parameters of interest.
It is a collection of methods for simulating random draws from
nonstandard distributions via Markov chains.
By repeatedly simulating steps of the chain, it simulates draws from the
distribution of interest.
- 10th
Oct, 2015 16 / 24

Method: Data Augmentation (DA) Algorithm
Goal
To have the iterates converge to the stationary distribution.
To simulate an approximately independent draw of the missing values.
Assumption
Assuming that the data are from a multivariate normal distribution.
Data augmentation is applied to Bayesian inference with missing data by
repeating the following steps:
Step - 1
The imputation I-step:
To estimate mean vector and covariance matrix.
I-step simulates the missing values for each observation independently.
The I-step draws values for Yi(mis) from a conditional distribution Yi(mis)
given Yi(obs) .
where, Yi(mis): the variables with missing values for observation i ;
Yi(obs): the variables with observed values for observation i .
- 10th
Oct, 2015 17 / 24

Method: Data Augmentation (DA) Algorithm
Step - 2
The posterior P-step:
P-step simulates the posterior population mean vector and covariance
matrix from the complete sample estimates by using non-informative
prior.
These new estimates are then used in the I-step.
Iterates converge to their stationary distribution and then to simulate an
approximately independent draw of the missing values.
Summary
Current parameter estimate θ(t) at tth iteration.
I-step draws Y
(t+1)
mis from P(Ymis|Yobs, θ(t))
P-step draws θ(t+1) from P(θ(t)|Yobs, Ymis)
This creates a Markov chain (Y
(1)
mis, θ(1)) , (Y
(2)
mis, θ(2)),........
It converges in distribution to P(Ymis, θ|Yobs).
- 10th
Oct, 2015 18 / 24

Method: Application in SAS
Multiple Imputation step 1
MCMC method used in conjunction with the IMPUTE=MONOTONE
option to create an imputed data set with a monotone missing pattern.
Variables include treatment group and VH scores at baseline and
post-baseline analysis visits.
This method implies that VH scores are analysed as continuous variables
and treatment group is a dummy variable.
SAS Code
proc mi data=dset1 out=MIstep1 seed=27160 nimpute=1000 noprint ;
mcmc impute=monotone chain=multiple ;
var armn baseline week2 week4 week6 week8 week10
week12 week14 week16;
run;
- 10th
Oct, 2015 19 / 24

Method: Application in SAS
Multiple Imputation step 2
Missing data are imputed with a regression method by using the
monotone data set from step 1
Variables include treatment group, stratiﬁcation variables and VH scores
at baseline and post-baseline analysis visits.
This method implies that VH scores are analysed as continuous variables.
Output data set from step 1 (after rounding) is used as input data set for
step 2.
Only 1 imputation in step 2 (for each imputation from step 1).
SAS Code
proc mi data=MIstep1r out=MIstep2 seed=54320 nimpute=1 noprint ;
var armn stratum baseline week2 week4 week6 week8 week10
week12 week14 week16;
class armn stratum;
monotone reg;
run;
- 10th
Oct, 2015 20 / 24

Result: Tabular representation of efﬁcacy endpoint
Table 1 : Change from baseline in VH Score to Week 16, MITT population
Vitreous Hazre (Miami 9-step scale) Placebo (N=43) Treatment A (N=57)
Baseline
Number 43 57
Mean (SD) 4.47 (1.96) 4.68 (2.49)
Median 5.00 5.00
Min : Max 1.0 : 8.0 1.0 : 8.0
Week 16
Number 28 44
Mean (SD) 4.18 (2.47) 4.64 (2.30)
Median 3.50 5.00
Min : Max 1.0 : 8.0 1.0 : 8.0
Change from Baseline
Number 28 44
Mean (SD) -0.11 (3.58) -0.25 (3.36)
Median 0.50 -1.00
Min : Max -7.0 : 6.0 -7.0 : 6.0
Analysis : Original Data
LS Means (SE) -0.42 (0.414) 0.06 (0.331)
90% CI (-1.103 to 0.262) (-0.485 to 0.604)
LS Mean differences (SE) vs. Placebo 0.48 (0.530)
90% CI (-0.393 to 1.354)
p-value 0.3653
Analysis : BOCF
LS Means (SE) -0.18 (0.340) -0.10 (0.295)
90% CI (-0.739 to 0.380) (-0.590 to 0.383)
90% CI (-0.665 to 0.817)
p-value 0.8659
Analysis : LOCF
LS Means (SE) 0.15 (0.329) 0.15 (0.329)
90% CI (-0.395 to 0.689) (-0.197 to 0.745)
LS Mean differences (SE) vs. Placebo
a
0.13 (0.436)
90% CI (-0.591 to 0.845)
p-value 0.7704
Analysis : Imputation (Bayesian)
LS Means (SE) -0.78 (0.380) -0.11 (0.317)
90% CI (-1.527 to -0.335) (-0.729 to 0.515)
90% CI (-0.298 to 1.645)
p-value 0.1739
- 10th
Oct, 2015 21 / 24

Conclusion and Discussion
From Table 1, we see that, LS mean change in VH score, from baseline
to week 16 is higher in the Treatment A compared to the placebo group,
but also tends to statistically signiﬁcant difference for imputation by
using Bayesian .
Improvement of p-values has been noticed for imputation by using
Bayesian (0.1739) than LOCF (0.7704) & BOCF (0.8659) compared to
original data (0.3653).
Bayesian approach lends itself naturally different choices of prior
distributions encoding assumptions about the missing data process.
It offers possibility of including informative prior information about
missing data process.But models can become computationally
challenging.
The procedure can be used in the data preparation steps before calling
the analysis model to simplify the clinical efﬁcacy data analysis process.
- 10th
Oct, 2015 22 / 24

References
Allison, P.D. (2000). Multiple Imputation for Missing Data: A
Cautionary Tale. Sociological Methods and Research, 28: 301-309.
Barnard J, Rubin DB (1999). Small-Sample Degrees of Freedom with
Multiple Imputation. Biometrika, 86: 948-955.
National Research Council. The Prevention and Treatment of Missing
Data in Clinical Trials. The Panel on Handling Missing Data in Clinical
Trials
Rubin DB (1976). Inference and Missing Data. Biometrika, 63: 581-592.
Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. John
Wiley & Sons.
Rubin DB (1996). Imputation After 18+ Years. Journal of the American
Statistical Association, 91: 473-489.
Yuan, Yang (2011). Multiple Imputation Using SAS Software. Journal
of Statistical Software, 45(6): 1-25.
- 10th
Oct, 2015 23 / 24

- 10th
Oct, 2015 24 / 24

D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (9)

Semelhante a D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Semelhante a D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya (20)

D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya