In this vignette, we provide a brief introduction to using the R package StackImpute. The purpose of this package is to provide resources for performing data analysis in the presence of missing data through analysis of a stacked version of the multiply imputed data. We will not include technical details about this estimation approach and instead focus on implementation. For additional details about the estimation algorithm, we refer the reader to “A stacked approach for chained equations multiple imputation incorporating the substantive model” by Lauren J Beesley and Jeremy M G Taylor in Biometrics (2020) and “Accounting for not-at-random missingness through imputation stacking” by Lauren J Beesley and Jeremy M G Taylor currently on arXiv.
Many researchers have noted that we can do a “good” job estimating regression model parameters by analyzing a stacked version of multiply imputed data. However, estimation of standard errors has proven to be a challenging problem. We provide several estimation strategies and corresponding software for routine estimation as follows:
For glms and coxph model fits, the function Louis_Information will estimate the information matrix accounting for imputation uncertainty.
For more general likelihood-based inference, the function Louis_Information_Custom will take user-specified score and covariance matrices from stacked and weighted analysis. For this function, we note that the model-output dispersion parameter for the glm function in R is incorrect for stacked multiple imputations and provide a function to obtain a better estimate of the dispersion parameter.
The function Bootstrap_Variance implements a bootstrap-based method for estimating the covariance matrix as proposed by Dr. Paul Bernhardt in “A Comparison of Stacked and Pooled Multiple Imputation” at the Joint Statistical Meetings in 2019.
Finally, the function Jackknife_Variance implements a jackknife-based variant of the estimator proposed by Dr. Paul Bernhardt.
First, we suppose that we have imputed missing data using usual multiple imputation incorporating both the outcome and covariates into the imputation. Instead of estimating outcome model parameters using Rubin’s combining rules, we propose stacking the multiple imputations to create one long dataset. Then, we perform a weighted analysis, where each subject is weighted by 1/M where M is the number of multiple imputations as follows:
library(dplyr)
### Simulate Data
Nobs = 2000
DAT = MASS::mvrnorm(n = Nobs, mu = c(0,0,0), Sigma = rbind(c(1, 0.18, 0.42), c(0.18, 0.09, 0.12),c(0.42, 0.12, 0.49 )))
Y = DAT[,1]
B = DAT[,2]
X = DAT[,3]
S = sample(x=c(0,1), size = Nobs, prob = c(0.5,0.5), replace = TRUE)
complete_cases = data.frame(Y, X, B, S)[S == 1,] #complete case subjects only
observed_data = data.frame(Y, X, B, S) #data with missingness in B
observed_data[S==0,'B'] = NA
### Step 1: Impute B|X,Y
imputes = mice::mice(observed_data, m=50, method="norm", printFlag=F, maxit = 1)
pred = imputes$predictorMatrix
pred[pred != 0] = 0
pred["B","X"] = 1
pred["B","Y"] = 1
imputes = mice::mice(observed_data, m=50, predictorMatrix=pred, method="norm", printFlag=F)
### Step 2: Stack imputed datasets
stack = mice::complete(imputes, action="long", include = FALSE)
### Step 3: Obtain weights
stack$wt = 1
stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt)))
### Step 4: Point estimation
fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt)
### Step 5a: Variance estimation option 1 (for glm and coxph models only)
Info = StackImpute::Louis_Information(fit, stack, M = 50)
VARIANCE = diag(solve(Info))
### Step 5b: Variance estimation using custom score and covariance matrices (any model with corresponding likelihood)
covariates = as.matrix(cbind(1,stack$X, stack$B))
score = sweep(covariates,1,stack$Y - covariates %*% matrix(coef(fit)), '*')/StackImpute::glm.weighted.dispersion(fit)
covariance_weighted = summary(fit)$cov.unscaled*StackImpute::glm.weighted.dispersion(fit)
Info = StackImpute::Louis_Information_Custom(score, covariance_weighted, stack, M = 50)
VARIANCE_custom = diag(solve(Info))
### Step 5c: Variance estimation using bootstrap (any model with vcov method)
bootcovar = StackImpute::Bootstrap_Variance(fit, stack, M = 50, n_boot = 100)
VARIANCE_boot = diag(bootcovar)
### Step 5d: Variance estimation using jackknife (any model with vcov method)
jackcovar = StackImpute::Jackknife_Variance(fit, stack, M = 50)
VARIANCE_jack = diag(jackcovar)
In Beesley and Taylor (2020) in Biometrics, we propose a modification to this data analysis pipeline that involves imputing missing covariates from distributions that do not condition on the outcome of interest and then performing a weighted analysis on the stacked data. In this case, weights are defined proportional to the distribution of the outcome given covariates in the complete case data. We can implement this modified analysis pipeline as follows:
### Step 1: Impute B|X
imputes = mice::mice(observed_data, m=50, method="norm", printFlag=F, maxit = 1)
pred = imputes$predictorMatrix
pred[pred != 0] = 0
pred["B","X"] = 1
imputes = mice::mice(observed_data, m=50, predictorMatrix=pred, method="norm", printFlag=F)
### Step 2: Stack imputed datasets
stack = mice::complete(imputes, action="long", include = FALSE)
### Step 3: Obtain weights
fit_cc = glm(Y ~ X + B, family='gaussian', data= complete_cases)
stack$wt = dnorm(stack$Y,mean = predict(fit_cc, newdata = stack), sd = sqrt(summary(fit_cc)$dispersion))
stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt)))
### Step 4: Point estimation
fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt)
### Any one of the above variance estimation strategies can then be applied.
We can equivalently performed our proposed data analysis using “short” stacked dataset in which subjects with fully-observed data only appear once. The advantage of this approach is that estimation may be faster and the stacked dataset may be much smaller. We can perform the above analysis using the shorter stack obtained below.
In Beesley and Taylor (2021) on arXiv, we propose another modification to this data analysis pipeline that accounts for not-at-random missingness through weighted analysis of stacked multiple imputations. Weights are a simple function of the imputed data and assumptions about the missingness mechanism. In the special case where missingness follows a logistic regression, the weights take a very simple form as shown.
### Simulate Data
prob_obs = exp(2*B + 1*Y)/(1+exp(2*B + 1*Y))
S_mnar = as.numeric(prob_obs > runif(Nobs,0,1))
complete_cases = data.frame(Y, X, B, S=S_mnar)[S_mnar == 1,] #complete case subjects only
observed_data_mnar = data.frame(Y, X, B, S=S_mnar) #data with missingness in B
observed_data_mnar[S_mnar==0,'B'] = NA
### Step 1: Impute B|X,Y
imputes_mnar = mice::mice(observed_data_mnar, m=50, method="norm", printFlag=F, maxit = 1)
pred = imputes_mnar$predictorMatrix
pred[pred != 0] = 0
pred["B","X"] = 1
pred["B","Y"] = 1
imputes_mnar = mice::mice(observed_data_mnar, m=50, predictorMatrix=pred, method="norm", printFlag=F)
### Step 2: Stack imputed datasets
stack = mice::complete(imputes_mnar, action="long", include = FALSE)
### Step 3: Obtain weights
phi1_assumed = 2
stack$wt = exp(-phi1_assumed*stack$B)
stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt)))
### Step 4: Point estimation
fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt)
### Any one of the above variance estimation strategies can then be applied.
In Tompsett et al. (2018) in Statistics in Medicine, researchers explore an alternative strategy for addressing not-at-random missingness. These researchers handle MNAR missingness by positing a pattern mixture model structure for the imputation models, where the adjusted association between a covariate’s value and its own missingness is treated as a fixed sensitivity analysis-type parameter. Here, we demonstrate how we can apply this method to address the MNAR missingness.
### Imputation Function (modified version of mice::mice.impute.mnar.norm())
mice.impute.mnar.norm2 = function (y, ry, x, wy = NULL, ums = NULL, umx = NULL, ...){
u <- mice:::parse.ums(x, ums = ums, umx = umx, ...)
if (is.null(wy))
wy <- !ry
x <- cbind(1, as.matrix(x))
parm <- mice:::.norm.draw(y, ry, x, ...)
return(x[wy, ] %*% parm$beta + as.matrix(u$x[wy, ]) %*% as.matrix(u$delta) + rnorm(sum(wy)) *parm$sigma)
}
### *Ideal* pattern mixture model offset parameter for these simulated data:
delta1_assumed = -0.087
### Step 1: Impute B|X,Y,S
mnar.blot <- list(B = list(ums =paste0('-',abs(delta1_assumed))))
imputes_pmm = mice::mice(observed_data_mnar, m=50, method="mnar.norm2", printFlag=F, maxit = 1, blots = mnar.blot)
pred = imputes_pmm$predictorMatrix
pred[pred != 0] = 0
pred["B","X"] = 1
pred["B","Y"] = 1
imputes_pmm = mice::mice(observed_data_mnar, m=50, predictorMatrix=pred, method="mnar.norm2", printFlag=F, blots = mnar.blot)
### Step 2: Apply Rubin's Rules to obtain point estimates and standard errors
fit = summary(mice::pool(with(imputes_pmm,glm(Y ~ X + B, family=gaussian()))))
param = fit$estimate
VARIANCE = (fit$std.error)^2