library(tidyverse)
library(mice)
library(miceadds)
library(lme4)
library(mitml)
Read in and prepare data.
cor <-read.csv("chamber_corr.csv")
bc_sum <- read.csv("bc_sum.csv")
bc_sum$id_dot <- as.character(bc_sum$id_dot)
bc_sum$id_dot <- if_else(bc_sum$id_dot == "22", "022", bc_sum$id_dot)
bc_sum$dep<- as.character(bc_sum$dep)
cor_clean <- cor %>%
filter(id_dot != 78) %>%
mutate(
id_dot = as.character(id_dot),
id_dot = if_else(id_dot == "22", "022", id_dot),
dep = as.character(dep),
total_light_hours = as.numeric(total_light_hours)) %>%
left_join(
bc_sum %>%
dplyr::select(dep, id_dot, bc_avg, pheo_avg),
by = c("dep", "id_dot")) %>%
rename(
bc = bc_avg,
pheo = pheo_avg) %>%
dplyr::select(dep, month, year, campaign, id_dot, r, gpp, nem,
total_light_hours, mean_temp, mean_day_par, sg_dens, bc)
The md.pattern() function in the mice package displays
missing-data patterns.
md.pattern(cor_clean, rotate.names = TRUE)
## dep month year campaign id_dot r gpp nem total_light_hours mean_temp
## 71 1 1 1 1 1 1 1 1 1 1
## 50 1 1 1 1 1 1 1 1 1 1
## 24 1 1 1 1 1 1 1 1 1 1
## 31 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0
## mean_day_par sg_dens bc
## 71 1 1 1 0
## 50 1 1 0 1
## 24 1 0 1 1
## 31 1 0 0 2
## 3 0 0 1 2
## 3 58 81 142
Above we see that mean_day_par has 3 missing values, sg_dens has 58 missing values, and bc as 81 missing values. The top row of the plot (all blue squares) says we have 71 complete cases. The next row says we have 50 cases with just bc missing. The next rows says we have 24 cases with just sg_dens missing. The last row shows 3 cases missing with mean_day_par and sg_dens missing.
Since mean_day_par has only 3 cases missing we could just impute the mean.
cor_clean$mean_day_par[is.na(cor_clean$mean_day_par)] <- mean(cor_clean$mean_day_par, na.rm = TRUE)
This simplifies the missing data pattern.
md.pattern(cor_clean, rotate.names = TRUE)
## dep month year campaign id_dot r gpp nem total_light_hours mean_temp
## 71 1 1 1 1 1 1 1 1 1 1
## 50 1 1 1 1 1 1 1 1 1 1
## 27 1 1 1 1 1 1 1 1 1 1
## 31 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0
## mean_day_par sg_dens bc
## 71 1 1 1 0
## 50 1 1 0 1
## 27 1 0 1 1
## 31 1 0 0 2
## 0 58 81 139
The mice package in R makes it relatively easy to implement multiple imputation. The author of the package also wrote a book that is free online.
Let’s say we want to fit the following model. Notice in the output that it is fit to only 71 observations.
m <- lmer(gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep),
data = cor_clean)
summary(m, corr = FALSE)
## Linear mixed model fit by REML ['lmerMod']
## Formula: gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep)
## Data: cor_clean
##
## REML criterion at convergence: 545.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1208 -0.4679 -0.0177 0.2913 3.2991
##
## Random effects:
## Groups Name Variance Std.Dev.
## dep (Intercept) 69.76 8.352
## Residual 81.73 9.041
## Number of obs: 71, groups: dep, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -27.55918 21.06094 -1.309
## mean_temp 2.75157 0.52307 5.260
## mean_day_par 0.05664 0.03356 1.688
## bc 0.02259 0.09875 0.229
## sg_dens -0.11182 0.04658 -2.400
To begin multiple imputation we need to declare which variables to
use to impute our missing variables. The
make.predictorMatrix() function creates a predictor matrix
for us. Each variable in the data has a row and a column in the
predictor matrix. A value of 1 indicates that the column variable is
used to impute the row variable.
d <- cor_clean[,c("gpp", "mean_temp", "mean_day_par", "bc", "sg_dens", "dep")]
d$dep <- as.integer(d$dep)
pred <- make.predictorMatrix(d)
pred
## gpp mean_temp mean_day_par bc sg_dens dep
## gpp 0 1 1 1 1 1
## mean_temp 1 0 1 1 1 1
## mean_day_par 1 1 0 1 1 1
## bc 1 1 1 0 1 1
## sg_dens 1 1 1 1 0 1
## dep 1 1 1 1 1 0
Since bc and sg_dens are the only missing variables, we only need to impute them. The other rows can be set to all 0s. An entry of -2 indicates the random effect grouping variable.
pred["gpp",] <- c(0, 0, 0, 0, 0, 0)
pred["mean_temp",] <- c(0, 0, 0, 0, 0, 0)
pred["mean_day_par",] <- c(0, 0, 0, 0, 0, 0)
pred["bc",] <- c(1, 1, 1, 0, 1, -2)
pred["sg_dens",] <- c(1, 1, 1, 1, 0, -2)
pred["dep",] <- c(0, 0, 0, 0, 0, 0)
pred
## gpp mean_temp mean_day_par bc sg_dens dep
## gpp 0 0 0 0 0 0
## mean_temp 0 0 0 0 0 0
## mean_day_par 0 0 0 0 0 0
## bc 1 1 1 0 1 -2
## sg_dens 1 1 1 1 0 -2
## dep 0 0 0 0 0 0
Now we specify which method we want to use to impute the missing variables. The default is pmm, or predictive mean matching.
meth <- make.method(d)
meth
## gpp mean_temp mean_day_par bc sg_dens dep
## "" "" "" "pmm" "pmm" ""
Since we have multilevel data nested within dep, we need to specify new methods. We use “2l.pmm” for bc since it varies within dep. We use “2lonly.pmm” for sg_dens since it is constant within dep.
# need the miceadds package for the 2lonly.pmm method
meth[c("bc", "sg_dens")] <- c("2l.pmm", "2lonly.pmm")
meth
## gpp mean_temp mean_day_par bc sg_dens dep
## "" "" "" "2l.pmm" "2lonly.pmm" ""
Finally we run the imputation using the mice() function.
Below I ran 10 imputations by setting m = 10.
imp <- mice(d, pred = pred, meth = meth, seed = 999,
m = 10, print = FALSE)
Calling plot() on the imputation object produces a
diagnostic trace plots. In general, we want the streams to intermingle
and be free of any trends at the later iterations. These plots look
good.
plot(imp)
Now let’s check the imputations. Imputation 1 is the complete data. 2 - 11 are the imputed data sets. Red dots are the imputed values. This looks good. We didn’t impute anything outside the range of data (though maybe it’s implausible that imputations should take one of the pre-existing 8 unique levels?)
stripplot(imp, sg_dens ~ .imp, pch=20, cex=2)
Check the imputations of bc with density plots. The blue line is the original data.
densityplot(imp, ~ bc)
Now fit our model to all ten imputed data sets and pool the results.
fit <- with(imp, lmer(gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep),
REML = FALSE))
summary(pool(fit))
## term estimate std.error statistic df p.value
## 1 (Intercept) -27.19502504 12.08166317 -2.2509339 24.76714 3.351774e-02
## 2 mean_temp 2.70561123 0.31346334 8.6313480 70.57690 1.181023e-12
## 3 mean_day_par 0.05710578 0.01747786 3.2673214 31.50624 2.623628e-03
## 4 bc 0.08060002 0.10991718 0.7332796 11.87076 4.776198e-01
## 5 sg_dens -0.12263028 0.03820308 -3.2099584 32.96555 2.956411e-03
Let’s round the values to make them easier to read.
sout <- summary(pool(fit))
apply(sout[,-1], 2, round, 3)
## estimate std.error statistic df p.value
## [1,] -27.195 12.082 -2.251 24.767 0.034
## [2,] 2.706 0.313 8.631 70.577 0.000
## [3,] 0.057 0.017 3.267 31.506 0.003
## [4,] 0.081 0.110 0.733 11.871 0.478
## [5,] -0.123 0.038 -3.210 32.966 0.003
Get the estimates of variance.
# this function from the mitml package
testEstimates(as.mitml.result(fit),extra.pars = TRUE)$extra.pars
## Estimate
## Intercept~~Intercept|dep 84.7748861
## Residual~~Residual 77.9030464
## ICC|dep 0.5155117
Compare to model fit to incomplete data.
summary(m, corr = FALSE)
## Linear mixed model fit by REML ['lmerMod']
## Formula: gpp ~ mean_temp + mean_day_par + bc + sg_dens + (1 | dep)
## Data: cor_clean
##
## REML criterion at convergence: 545.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1208 -0.4679 -0.0177 0.2913 3.2991
##
## Random effects:
## Groups Name Variance Std.Dev.
## dep (Intercept) 69.76 8.352
## Residual 81.73 9.041
## Number of obs: 71, groups: dep, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -27.55918 21.06094 -1.309
## mean_temp 2.75157 0.52307 5.260
## mean_day_par 0.05664 0.03356 1.688
## bc 0.02259 0.09875 0.229
## sg_dens -0.11182 0.04658 -2.400
Honestly, not much difference.