--- title: "UsingStackImpute" author: "Created by Dr. Lauren J Beesley. Contact: lbeesley@umich.edu" date: "`r format(Sys.time(), '%d %B, %Y')`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{UsingStackImpute} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In this vignette, we provide a brief introduction to using the R package *StackImpute*. The purpose of this package is to provide resources for performing data analysis in the presence of missing data through analysis of a stacked version of the multiply imputed data. We will not include technical details about this estimation approach and instead focus on implementation. For additional details about the estimation algorithm, we refer the reader to "A stacked approach for chained equations multiple imputation incorporating the substantive model" by Lauren J Beesley and Jeremy M G Taylor in *Biometrics* (2020) and "Accounting for not-at-random missingness through imputation stacking" by Lauren J Beesley and Jeremy M G Taylor currently on *arXiv*. Many researchers have noted that we can do a "good" job estimating regression model parameters by analyzing a stacked version of multiply imputed data. However, estimation of standard errors has proven to be a challenging problem. We provide several estimation strategies and corresponding software for routine estimation as follows: (1) For glms and coxph model fits, the function *Louis_Information* will estimate the information matrix accounting for imputation uncertainty. (2) For more general likelihood-based inference, the function *Louis_Information_Custom* will take user-specified score and covariance matrices from stacked and weighted analysis. For this function, we note that the model-output dispersion parameter for the *glm* function in R is incorrect for stacked multiple imputations and provide a function to obtain a better estimate of the dispersion parameter. (3) The function *Bootstrap_Variance* implements a bootstrap-based method for estimating the covariance matrix as proposed by Dr. Paul Bernhardt in "A Comparison of Stacked and Pooled Multiple Imputation" at the Joint Statistical Meetings in 2019. (4) Finally, the function *Jackknife_Variance* implements a jackknife-based variant of the estimator proposed by Dr. Paul Bernhardt. ## Example analysis using standard imputation stacking First, we suppose that we have imputed missing data using usual multiple imputation incorporating both the outcome and covariates into the imputation. Instead of estimating outcome model parameters using Rubin's combining rules, we propose stacking the multiple imputations to create one long dataset. Then, we perform a weighted analysis, where each subject is weighted by 1/M where M is the number of multiple imputations as follows: ```{r, echo = TRUE, eval = FALSE, fig.width = 7, fig.height= 4} library(dplyr) ### Simulate Data Nobs = 2000 DAT = MASS::mvrnorm(n = Nobs, mu = c(0,0,0), Sigma = rbind(c(1, 0.18, 0.42), c(0.18, 0.09, 0.12),c(0.42, 0.12, 0.49 ))) Y = DAT[,1] B = DAT[,2] X = DAT[,3] S = sample(x=c(0,1), size = Nobs, prob = c(0.5,0.5), replace = TRUE) complete_cases = data.frame(Y, X, B, S)[S == 1,] #complete case subjects only observed_data = data.frame(Y, X, B, S) #data with missingness in B observed_data[S==0,'B'] = NA ### Step 1: Impute B|X,Y imputes = mice::mice(observed_data, m=50, method="norm", printFlag=F, maxit = 1) pred = imputes$predictorMatrix pred[pred != 0] = 0 pred["B","X"] = 1 pred["B","Y"] = 1 imputes = mice::mice(observed_data, m=50, predictorMatrix=pred, method="norm", printFlag=F) ### Step 2: Stack imputed datasets stack = mice::complete(imputes, action="long", include = FALSE) ### Step 3: Obtain weights stack$wt = 1 stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt))) ### Step 4: Point estimation fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt) ### Step 5a: Variance estimation option 1 (for glm and coxph models only) Info = StackImpute::Louis_Information(fit, stack, M = 50) VARIANCE = diag(solve(Info)) ### Step 5b: Variance estimation using custom score and covariance matrices (any model with corresponding likelihood) covariates = as.matrix(cbind(1,stack$X, stack$B)) score = sweep(covariates,1,stack$Y - covariates %*% matrix(coef(fit)), '*')/StackImpute::glm.weighted.dispersion(fit) covariance_weighted = summary(fit)$cov.unscaled*StackImpute::glm.weighted.dispersion(fit) Info = StackImpute::Louis_Information_Custom(score, covariance_weighted, stack, M = 50) VARIANCE_custom = diag(solve(Info)) ### Step 5c: Variance estimation using bootstrap (any model with vcov method) bootcovar = StackImpute::Bootstrap_Variance(fit, stack, M = 50, n_boot = 100) VARIANCE_boot = diag(bootcovar) ### Step 5d: Variance estimation using jackknife (any model with vcov method) jackcovar = StackImpute::Jackknife_Variance(fit, stack, M = 50) VARIANCE_jack = diag(jackcovar) ``` ## Example analysis using modified imputation stacking allowing for compatibility of the imputation and analysis models In Beesley and Taylor (2020) in *Biometrics*, we propose a modification to this data analysis pipeline that involves imputing missing covariates from distributions that do not condition on the outcome of interest and then performing a weighted analysis on the stacked data. In this case, weights are defined proportional to the distribution of the outcome given covariates in the complete case data. We can implement this modified analysis pipeline as follows: ```{r, echo = TRUE, eval = FALSE, fig.width = 7, fig.height= 4} ### Step 1: Impute B|X imputes = mice::mice(observed_data, m=50, method="norm", printFlag=F, maxit = 1) pred = imputes$predictorMatrix pred[pred != 0] = 0 pred["B","X"] = 1 imputes = mice::mice(observed_data, m=50, predictorMatrix=pred, method="norm", printFlag=F) ### Step 2: Stack imputed datasets stack = mice::complete(imputes, action="long", include = FALSE) ### Step 3: Obtain weights fit_cc = glm(Y ~ X + B, family='gaussian', data= complete_cases) stack$wt = dnorm(stack$Y,mean = predict(fit_cc, newdata = stack), sd = sqrt(summary(fit_cc)$dispersion)) stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt))) ### Step 4: Point estimation fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt) ### Any one of the above variance estimation strategies can then be applied. ``` ## A note on stack structure We can equivalently performed our proposed data analysis using "short" stacked dataset in which subjects with fully-observed data only appear once. The advantage of this approach is that estimation may be faster and the stacked dataset may be much smaller. We can perform the above analysis using the shorter stack obtained below. ```{r, echo = TRUE, eval = FALSE, fig.width = 7, fig.height= 4} ### Step 2: Stack imputed datasets stack = mice::complete(imputes, action="long", include = FALSE) cc = unique(stack$.id[stack$S == 1]) stack_short = rbind(stack[stack$S==0,], stack[stack$S==1 & !duplicated(stack$.id),]) ``` ## Example analysis addressing not-at-random missingness through stacked and weighted analysis In Beesley and Taylor (2021) on *arXiv*, we propose another modification to this data analysis pipeline that accounts for not-at-random missingness through weighted analysis of stacked multiple imputations. Weights are a simple function of the imputed data and assumptions about the missingness mechanism. In the special case where missingness follows a logistic regression, the weights take a very simple form as shown. ```{r, echo = TRUE, eval = FALSE, fig.width = 7, fig.height= 4} ### Simulate Data prob_obs = exp(2*B + 1*Y)/(1+exp(2*B + 1*Y)) S_mnar = as.numeric(prob_obs > runif(Nobs,0,1)) complete_cases = data.frame(Y, X, B, S=S_mnar)[S_mnar == 1,] #complete case subjects only observed_data_mnar = data.frame(Y, X, B, S=S_mnar) #data with missingness in B observed_data_mnar[S_mnar==0,'B'] = NA ### Step 1: Impute B|X,Y imputes_mnar = mice::mice(observed_data_mnar, m=50, method="norm", printFlag=F, maxit = 1) pred = imputes_mnar$predictorMatrix pred[pred != 0] = 0 pred["B","X"] = 1 pred["B","Y"] = 1 imputes_mnar = mice::mice(observed_data_mnar, m=50, predictorMatrix=pred, method="norm", printFlag=F) ### Step 2: Stack imputed datasets stack = mice::complete(imputes_mnar, action="long", include = FALSE) ### Step 3: Obtain weights phi1_assumed = 2 stack$wt = exp(-phi1_assumed*stack$B) stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt))) ### Step 4: Point estimation fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt) ### Any one of the above variance estimation strategies can then be applied. ``` ## Example analysis addressing not-at-random missingness using the method of Tompsett et al. (2018) In Tompsett et al. (2018) in *Statistics in Medicine*, researchers explore an alternative strategy for addressing not-at-random missingness. These researchers handle MNAR missingness by positing a pattern mixture model structure for the imputation models, where the adjusted association between a covariate's value and its own missingness is treated as a fixed sensitivity analysis-type parameter. Here, we demonstrate how we can apply this method to address the MNAR missingness. ```{r, echo = TRUE, eval = FALSE, fig.width = 7, fig.height= 4} ### Imputation Function (modified version of mice::mice.impute.mnar.norm()) mice.impute.mnar.norm2 = function (y, ry, x, wy = NULL, ums = NULL, umx = NULL, ...){ u <- mice:::parse.ums(x, ums = ums, umx = umx, ...) if (is.null(wy)) wy <- !ry x <- cbind(1, as.matrix(x)) parm <- mice:::.norm.draw(y, ry, x, ...) return(x[wy, ] %*% parm$beta + as.matrix(u$x[wy, ]) %*% as.matrix(u$delta) + rnorm(sum(wy)) *parm$sigma) } ### *Ideal* pattern mixture model offset parameter for these simulated data: delta1_assumed = -0.087 ### Step 1: Impute B|X,Y,S mnar.blot <- list(B = list(ums =paste0('-',abs(delta1_assumed)))) imputes_pmm = mice::mice(observed_data_mnar, m=50, method="mnar.norm2", printFlag=F, maxit = 1, blots = mnar.blot) pred = imputes_pmm$predictorMatrix pred[pred != 0] = 0 pred["B","X"] = 1 pred["B","Y"] = 1 imputes_pmm = mice::mice(observed_data_mnar, m=50, predictorMatrix=pred, method="mnar.norm2", printFlag=F, blots = mnar.blot) ### Step 2: Apply Rubin's Rules to obtain point estimates and standard errors fit = summary(mice::pool(with(imputes_pmm,glm(Y ~ X + B, family=gaussian())))) param = fit$estimate VARIANCE = (fit$std.error)^2 ```