---
title: "UsingStackImpute"
author: "Created by Dr. Lauren J Beesley. Contact: lbeesley@umich.edu"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{UsingStackImpute}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


In this vignette, we provide a brief introduction to using the R package *StackImpute*. The purpose of this package is to provide resources for performing data analysis in the presence of missing data through analysis of a stacked version of the multiply imputed data. We will not include technical details about this estimation approach and instead focus on implementation. For additional details about the estimation algorithm, we refer the reader to "A stacked approach for chained equations multiple imputation incorporating the substantive model" by Lauren J Beesley and Jeremy M G Taylor in *Biometrics* (2020) and "Accounting for not-at-random missingness through imputation stacking" by Lauren J Beesley and Jeremy M G Taylor currently on *arXiv*. 

Many researchers have noted that we can do a "good" job estimating regression model parameters by analyzing a stacked version of multiply imputed data. However, estimation of standard errors has proven to be a challenging problem. We provide several estimation strategies and corresponding software for routine estimation as follows:

(1) For glms and coxph model fits, the function *Louis_Information* will estimate the information matrix accounting for imputation uncertainty. 

(2) For more general likelihood-based inference, the function *Louis_Information_Custom* will take user-specified score and covariance matrices from stacked and weighted analysis. For this function, we note that the model-output dispersion parameter for the *glm* function in R is incorrect for stacked multiple imputations and provide a  function to obtain a better estimate of the dispersion parameter.  

(3) The function *Bootstrap_Variance* implements a bootstrap-based method for estimating the covariance matrix as proposed by Dr. Paul Bernhardt in "A Comparison of Stacked and Pooled Multiple Imputation" at the Joint Statistical Meetings in 2019. 

(4) Finally, the function *Jackknife_Variance* implements a jackknife-based variant of the estimator proposed by Dr. Paul Bernhardt.

## Example analysis using standard imputation stacking 
First, we suppose that we have imputed missing data using usual multiple imputation incorporating both the outcome and covariates into the imputation. Instead of estimating outcome model parameters using Rubin's combining rules, we propose stacking the multiple imputations to create one long dataset. Then, we perform a weighted analysis, where each subject is weighted by 1/M where M is the number of multiple imputations as follows:

```{r, echo = TRUE, eval = FALSE,  fig.width = 7, fig.height= 4}
library(dplyr)
### Simulate Data
Nobs = 2000
DAT = MASS::mvrnorm(n = Nobs, mu = c(0,0,0), Sigma = rbind(c(1, 0.18, 0.42), c(0.18, 0.09, 0.12),c(0.42, 0.12, 0.49 )))
Y = DAT[,1]
B = DAT[,2]
X = DAT[,3]
S = sample(x=c(0,1), size = Nobs, prob = c(0.5,0.5), replace = TRUE)
complete_cases = data.frame(Y, X, B, S)[S == 1,] #complete case subjects only
observed_data = data.frame(Y, X, B, S) #data with missingness in B
observed_data[S==0,'B'] = NA

### Step 1: Impute B|X,Y
imputes = mice::mice(observed_data, m=50, method="norm", printFlag=F, maxit = 1)
pred = imputes$predictorMatrix 
pred[pred != 0] = 0
pred["B","X"] = 1
pred["B","Y"] = 1
imputes = mice::mice(observed_data, m=50, predictorMatrix=pred, method="norm", printFlag=F)

### Step 2: Stack imputed datasets	
stack = mice::complete(imputes, action="long", include = FALSE)

### Step 3: Obtain weights
stack$wt = 1
stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt)))

### Step 4: Point estimation
fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt)

### Step 5a: Variance estimation option 1 (for glm and coxph models only)
Info = StackImpute::Louis_Information(fit, stack, M = 50)
VARIANCE = diag(solve(Info))

### Step 5b: Variance estimation using custom score and covariance matrices (any model with corresponding likelihood)
covariates = as.matrix(cbind(1,stack$X, stack$B))
score = sweep(covariates,1,stack$Y - covariates %*% matrix(coef(fit)), '*')/StackImpute::glm.weighted.dispersion(fit)
covariance_weighted = summary(fit)$cov.unscaled*StackImpute::glm.weighted.dispersion(fit) 
Info = StackImpute::Louis_Information_Custom(score, covariance_weighted, stack, M = 50)
VARIANCE_custom = diag(solve(Info))

### Step 5c: Variance estimation using bootstrap (any model with vcov method)
bootcovar = StackImpute::Bootstrap_Variance(fit, stack, M = 50, n_boot = 100)
VARIANCE_boot = diag(bootcovar)

### Step 5d: Variance estimation using jackknife (any model with vcov method)
jackcovar = StackImpute::Jackknife_Variance(fit, stack, M = 50)
VARIANCE_jack = diag(jackcovar)

```


## Example analysis using modified imputation stacking allowing for compatibility of the imputation and analysis models

In Beesley and Taylor (2020) in *Biometrics*, we propose a modification to this data analysis pipeline that involves imputing missing covariates from distributions that do not condition on the outcome of interest and then performing a weighted analysis on the stacked data. In this case, weights are defined proportional to the distribution of the outcome given covariates in the complete case data. We can implement this modified analysis pipeline as follows: 

```{r, echo = TRUE, eval = FALSE,  fig.width = 7, fig.height= 4}
### Step 1: Impute B|X
imputes = mice::mice(observed_data, m=50, method="norm", printFlag=F, maxit = 1)
pred = imputes$predictorMatrix 
pred[pred != 0] = 0
pred["B","X"] = 1
imputes = mice::mice(observed_data, m=50, predictorMatrix=pred, method="norm", printFlag=F)

### Step 2: Stack imputed datasets	
stack = mice::complete(imputes, action="long", include = FALSE)

### Step 3: Obtain weights
fit_cc = glm(Y ~ X + B, family='gaussian', data= complete_cases)
stack$wt = dnorm(stack$Y,mean = predict(fit_cc, newdata = stack), sd = sqrt(summary(fit_cc)$dispersion))
stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt)))

### Step 4: Point estimation
fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt)

### Any one of the above variance estimation strategies can then be applied.
```


## A note on stack structure

We can equivalently performed our proposed data analysis using "short" stacked dataset in which subjects with fully-observed data only appear once. The advantage of this approach is that estimation may be faster and the stacked dataset may be much smaller. We can perform the above analysis using the shorter stack obtained below. 
```{r, echo = TRUE, eval = FALSE,  fig.width = 7, fig.height= 4}
### Step 2: Stack imputed datasets	
stack = mice::complete(imputes, action="long", include = FALSE)
cc = unique(stack$.id[stack$S == 1])
stack_short = rbind(stack[stack$S==0,], stack[stack$S==1 & !duplicated(stack$.id),])
```


## Example analysis addressing not-at-random missingness through stacked and weighted analysis

In Beesley and Taylor (2021) on *arXiv*, we propose another modification to this data analysis pipeline that accounts for not-at-random missingness through weighted analysis of stacked multiple imputations. Weights are a simple function of the imputed data and assumptions about the missingness mechanism. In the special case where missingness follows a logistic regression, the weights take a very simple form as shown.

```{r, echo = TRUE, eval = FALSE,  fig.width = 7, fig.height= 4}
### Simulate Data
prob_obs = exp(2*B + 1*Y)/(1+exp(2*B + 1*Y))
S_mnar = as.numeric(prob_obs > runif(Nobs,0,1))
complete_cases = data.frame(Y, X, B, S=S_mnar)[S_mnar == 1,] #complete case subjects only
observed_data_mnar = data.frame(Y, X, B, S=S_mnar) #data with missingness in B
observed_data_mnar[S_mnar==0,'B'] = NA

### Step 1: Impute B|X,Y
imputes_mnar = mice::mice(observed_data_mnar, m=50, method="norm", printFlag=F, maxit = 1)
pred = imputes_mnar$predictorMatrix 
pred[pred != 0] = 0
pred["B","X"] = 1
pred["B","Y"] = 1
imputes_mnar = mice::mice(observed_data_mnar, m=50, predictorMatrix=pred, method="norm", printFlag=F)

### Step 2: Stack imputed datasets
stack = mice::complete(imputes_mnar, action="long", include = FALSE)

### Step 3: Obtain weights
phi1_assumed = 2
stack$wt = exp(-phi1_assumed*stack$B)
stack = as.data.frame(stack %>% group_by(.id) %>% mutate(wt = wt / sum(wt)))

### Step 4: Point estimation
fit = glm(Y ~X + B, data=stack, family=gaussian(), weights = stack$wt)

### Any one of the above variance estimation strategies can then be applied.
```


## Example analysis addressing not-at-random missingness using the method of Tompsett et al. (2018)

In Tompsett et al. (2018) in *Statistics in Medicine*, researchers explore an alternative strategy for addressing not-at-random missingness. These researchers handle MNAR missingness by positing a pattern mixture model structure for the imputation models, where the adjusted association between a covariate's value and its own missingness is treated as a fixed sensitivity analysis-type parameter. Here, we demonstrate how we can apply this method to address the MNAR missingness.

<!-- results = c() -->
<!-- resultsTomp = c() -->
<!-- theta = c() -->
<!-- for(i in c(1:100)){ -->
<!-- S_mnar = as.numeric(prob_obs > runif(Nobs,0,1)) -->
<!-- fit = glm(S_mnar~X+B+Y, family = 'binomial') -->
<!-- fitTomp = glm(B~X+Y+as.numeric(1-S_mnar), family = 'gaussian') -->
<!-- fitTheta = glm(Y~X+B, family = 'gaussian') -->
<!-- results = rbind(results, as.numeric(coef(fit))) -->
<!-- resultsTomp = rbind(resultsTomp, as.numeric(coef(fitTomp))) -->
<!-- theta = rbind(theta, as.numeric(coef(fitTheta))) -->

<!-- } -->
<!-- apply(results[,2:4],2,mean) -->
<!-- apply(resultsTomp[,2:4],2,mean) -->
<!-- apply(theta[,2:3],2,mean) -->


<!-- plot(B[S_mnar == 1]) -->
<!-- points(B[S_mnar == 0], col = 'red') -->

```{r, echo = TRUE, eval = FALSE,  fig.width = 7, fig.height= 4}
### Imputation Function (modified version of mice::mice.impute.mnar.norm())
mice.impute.mnar.norm2 = function (y, ry, x, wy = NULL, ums = NULL, umx = NULL, ...){
  u <- mice:::parse.ums(x, ums = ums, umx = umx, ...)
  if (is.null(wy))
    wy <- !ry
  x <- cbind(1, as.matrix(x))
  parm <- mice:::.norm.draw(y, ry, x, ...)
  return(x[wy, ] %*% parm$beta + as.matrix(u$x[wy, ]) %*% as.matrix(u$delta) + rnorm(sum(wy)) *parm$sigma)
}

### *Ideal* pattern mixture model offset parameter for these simulated data:
delta1_assumed = -0.087

### Step 1: Impute B|X,Y,S
mnar.blot <- list(B = list(ums =paste0('-',abs(delta1_assumed)))) 
imputes_pmm = mice::mice(observed_data_mnar, m=50, method="mnar.norm2", printFlag=F, maxit = 1, blots = mnar.blot)
pred = imputes_pmm$predictorMatrix 
pred[pred != 0] = 0
pred["B","X"] = 1
pred["B","Y"] = 1
imputes_pmm = mice::mice(observed_data_mnar, m=50, predictorMatrix=pred, method="mnar.norm2", printFlag=F, blots = mnar.blot)

### Step 2: Apply Rubin's Rules to obtain point estimates and standard errors
fit = summary(mice::pool(with(imputes_pmm,glm(Y ~ X + B, family=gaussian()))))
param = fit$estimate
VARIANCE = (fit$std.error)^2
```