Example of Multiple Imputation

library(tidyverse)
library(mice)
library(miceadds)
library(lme4)
library(mitml)

Read in and prepare data.

cor <-read.csv("chamber_corr.csv")
bc_sum <- read.csv("bc_sum.csv")
bc_sum$id_dot <- as.character(bc_sum$id_dot)
bc_sum$id_dot <- if_else(bc_sum$id_dot == "22", "022", bc_sum$id_dot)
bc_sum$dep<- as.character(bc_sum$dep)

cor_clean <- cor %>%
  filter(id_dot != 78) %>%
  mutate(
    id_dot = as.character(id_dot),
    id_dot = if_else(id_dot == "22", "022", id_dot),
    dep = as.character(dep),
    total_light_hours = as.numeric(total_light_hours)) %>%
  left_join(
    bc_sum %>%
      dplyr::select(dep, id_dot, bc_avg, pheo_avg),
    by = c("dep", "id_dot")) %>%
  rename(
    bc = bc_avg,
    pheo = pheo_avg) %>% 
  dplyr::select(dep, month, year, campaign, id_dot, r, gpp, nem, 
                total_light_hours, mean_temp, mean_day_par, sg_dens, bc)

Investigate missingness

The md.pattern() function in the mice package displays missing-data patterns.

md.pattern(cor_clean, rotate.names = TRUE)

##    dep month year campaign id_dot r gpp nem total_light_hours mean_temp
## 71   1     1    1        1      1 1   1   1                 1         1
## 50   1     1    1        1      1 1   1   1                 1         1
## 24   1     1    1        1      1 1   1   1                 1         1
## 31   1     1    1        1      1 1   1   1                 1         1
## 3    1     1    1        1      1 1   1   1                 1         1
##      0     0    0        0      0 0   0   0                 0         0
##    mean_day_par sg_dens bc    
## 71            1       1  1   0
## 50            1       1  0   1
## 24            1       0  1   1
## 31            1       0  0   2
## 3             0       0  1   2
##               3      58 81 142

Above we see that mean_day_par has 3 missing values, sg_dens has 58 missing values, and bc as 81 missing values. The top row of the plot (all blue squares) says we have 71 complete cases. The next row says we have 50 cases with just bc missing. The next rows says we have 24 cases with just sg_dens missing. The last row shows 3 cases missing with mean_day_par and sg_dens missing.

Since mean_day_par has only 3 cases missing we could just impute the mean.

cor_clean$mean_day_par[is.na(cor_clean$mean_day_par)] <- mean(cor_clean$mean_day_par, na.rm = TRUE)

This simplifies the missing data pattern.

md.pattern(cor_clean, rotate.names = TRUE)

##    dep month year campaign id_dot r gpp nem total_light_hours mean_temp
## 71   1     1    1        1      1 1   1   1                 1         1
## 50   1     1    1        1      1 1   1   1                 1         1
## 27   1     1    1        1      1 1   1   1                 1         1
## 31   1     1    1        1      1 1   1   1                 1         1
##      0     0    0        0      0 0   0   0                 0         0
##    mean_day_par sg_dens bc    
## 71            1       1  1   0
## 50            1       1  0   1
## 27            1       0  1   1
## 31            1       0  0   2
##               0      58 81 139

Multiple imputation

The mice package in R makes it relatively easy to implement multiple imputation. The author of the package also wrote a book that is free online.

Let’s say we want to fit the following model. Notice in the output that it is fit to only 71 observations.

m <- lmer(gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep),
           data = cor_clean)
summary(m, corr = FALSE)

## Linear mixed model fit by REML ['lmerMod']
## Formula: gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep)
##    Data: cor_clean
## 
## REML criterion at convergence: 545.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1208 -0.4679 -0.0177  0.2913  3.2991 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  dep      (Intercept) 69.76    8.352   
##  Residual             81.73    9.041   
## Number of obs: 71, groups:  dep, 20
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)  -27.55918   21.06094  -1.309
## mean_temp      2.75157    0.52307   5.260
## mean_day_par   0.05664    0.03356   1.688
## bc             0.02259    0.09875   0.229
## sg_dens       -0.11182    0.04658  -2.400

To begin multiple imputation we need to declare which variables to use to impute our missing variables. The make.predictorMatrix() function creates a predictor matrix for us. Each variable in the data has a row and a column in the predictor matrix. A value of 1 indicates that the column variable is used to impute the row variable.

d <- cor_clean[,c("gpp", "mean_temp", "mean_day_par", "bc", "sg_dens", "dep")]
d$dep <- as.integer(d$dep)
pred <- make.predictorMatrix(d)
pred

##              gpp mean_temp mean_day_par bc sg_dens dep
## gpp            0         1            1  1       1   1
## mean_temp      1         0            1  1       1   1
## mean_day_par   1         1            0  1       1   1
## bc             1         1            1  0       1   1
## sg_dens        1         1            1  1       0   1
## dep            1         1            1  1       1   0

Since bc and sg_dens are the only missing variables, we only need to impute them. The other rows can be set to all 0s. An entry of -2 indicates the random effect grouping variable.

pred["gpp",]          <- c(0, 0, 0, 0, 0, 0)
pred["mean_temp",]    <- c(0, 0, 0, 0, 0, 0)
pred["mean_day_par",] <- c(0, 0, 0, 0, 0, 0)
pred["bc",]           <- c(1, 1, 1, 0, 1, -2)
pred["sg_dens",]      <- c(1, 1, 1, 1, 0, -2)
pred["dep",]          <- c(0, 0, 0, 0, 0, 0)
pred

##              gpp mean_temp mean_day_par bc sg_dens dep
## gpp            0         0            0  0       0   0
## mean_temp      0         0            0  0       0   0
## mean_day_par   0         0            0  0       0   0
## bc             1         1            1  0       1  -2
## sg_dens        1         1            1  1       0  -2
## dep            0         0            0  0       0   0

Now we specify which method we want to use to impute the missing variables. The default is pmm, or predictive mean matching.

meth <- make.method(d)
meth

##          gpp    mean_temp mean_day_par           bc      sg_dens          dep 
##           ""           ""           ""        "pmm"        "pmm"           ""

Since we have multilevel data nested within dep, we need to specify new methods. We use “2l.pmm” for bc since it varies within dep. We use “2lonly.pmm” for sg_dens since it is constant within dep.

# need the miceadds package for the 2lonly.pmm method
meth[c("bc", "sg_dens")] <- c("2l.pmm", "2lonly.pmm")
meth

##          gpp    mean_temp mean_day_par           bc      sg_dens          dep 
##           ""           ""           ""     "2l.pmm" "2lonly.pmm"           ""

Finally we run the imputation using the mice() function. Below I ran 10 imputations by setting m = 10.

imp <- mice(d, pred = pred, meth = meth, seed = 999,
            m = 10, print = FALSE)

Calling plot() on the imputation object produces a diagnostic trace plots. In general, we want the streams to intermingle and be free of any trends at the later iterations. These plots look good.

plot(imp)

Now let’s check the imputations. Imputation 1 is the complete data. 2 - 11 are the imputed data sets. Red dots are the imputed values. This looks good. We didn’t impute anything outside the range of data (though maybe it’s implausible that imputations should take one of the pre-existing 8 unique levels?)

stripplot(imp, sg_dens ~ .imp, pch=20, cex=2)

Check the imputations of bc with density plots. The blue line is the original data.

densityplot(imp, ~ bc)

Now fit our model to all ten imputed data sets and pool the results.

fit <- with(imp, lmer(gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep), 
                      REML = FALSE))
summary(pool(fit))

##           term     estimate   std.error  statistic       df      p.value
## 1  (Intercept) -27.19502504 12.08166317 -2.2509339 24.76714 3.351774e-02
## 2    mean_temp   2.70561123  0.31346334  8.6313480 70.57690 1.181023e-12
## 3 mean_day_par   0.05710578  0.01747786  3.2673214 31.50624 2.623628e-03
## 4           bc   0.08060002  0.10991718  0.7332796 11.87076 4.776198e-01
## 5      sg_dens  -0.12263028  0.03820308 -3.2099584 32.96555 2.956411e-03

Let’s round the values to make them easier to read.

sout <- summary(pool(fit))
apply(sout[,-1], 2, round, 3)

##      estimate std.error statistic     df p.value
## [1,]  -27.195    12.082    -2.251 24.767   0.034
## [2,]    2.706     0.313     8.631 70.577   0.000
## [3,]    0.057     0.017     3.267 31.506   0.003
## [4,]    0.081     0.110     0.733 11.871   0.478
## [5,]   -0.123     0.038    -3.210 32.966   0.003

Get the estimates of variance.

# this function from the mitml package
testEstimates(as.mitml.result(fit),extra.pars = TRUE)$extra.pars

##                            Estimate
## Intercept~~Intercept|dep 84.7748861
## Residual~~Residual       77.9030464
## ICC|dep                   0.5155117

Compare to model fit to incomplete data.

summary(m, corr = FALSE)

## Linear mixed model fit by REML ['lmerMod']
## Formula: gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep)
##    Data: cor_clean
## 
## REML criterion at convergence: 545.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1208 -0.4679 -0.0177  0.2913  3.2991 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  dep      (Intercept) 69.76    8.352   
##  Residual             81.73    9.041   
## Number of obs: 71, groups:  dep, 20
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)  -27.55918   21.06094  -1.309
## mean_temp      2.75157    0.52307   5.260
## mean_day_par   0.05664    0.03356   1.688
## bc             0.02259    0.09875   0.229
## sg_dens       -0.11182    0.04658  -2.400

Honestly, not much difference.

Example of Multiple Imputation

Clay Ford

2026-05-14

Investigate missingness

Multiple imputation