Student is studying variation in forest canopy depth across the eastern United States. She is examining how canopy depth changes with latitude and to identify “key drivers” at both the among-site and within-site scales. The study includes 11 forest sites spanning a broad range of latitudes and five potential predictors.
In the document sent before the meeting, the following statement is made (emphasis mine):
To further address collinearity and enable causal interpretation, we separated shared and independent effects prior to structural modelling. Forest age and canopy height were each regressed against solar angle, and then their residuals—representing variation independent of latitude—were used in a structural equation model (SEM) to quantify direct and indirect pathways on canopy depth.
This method was unfamiliar to me so I did some investigating.
This idea of using residuals as predictors apparently comes from the Frisch-Waugh-Lovell Theorem. This page provides an accessible overview along with a demo in R. The main attraction is that it can create orthogonal predictors with no correlation. But the new predictors (the residuals) are estimates themselves, so additional error may propagate through the model. This is also known as the Partialling-out Procedure. (See graphic at the top of this page.)
Here is the SEM model the student fit.
In words, “age affects height, which affects depth. Solar angle also affects depth.” In this model she is using residuals for the age and height predictors. In other words, she fit the following models and extracted the residuals. Something like this, I think:
age_resid <- lm(age ~ solar_angle)$residuals
height_resid <- lm(height ~ solar_angle)$residuals
In the SEM, age_resid and height_resid are
being used as the predictors. Presumably this is desirable because the
raw age and height values are highly correlated with solar angle. The
derived residual values are orthogonal to solar angle. However, since
solar angle is also in the model, the estimated coefficients
should be the same. So the step of deriving the residuals and
using them in the model should be redundant.
Simulate correlated data according to the SEM path model above. The “true” coefficients are as follows:
set.seed(4102026)
n <- 100
solar_angle <- rnorm(n)
age <- 0.8*solar_angle + rnorm(n, sd = 0.2) # induce correlation with solar_angle
height <- 0.5*age + rnorm(n, sd = 0.2)
depth <- -0.7*solar_angle + 0.5*height + rnorm(n, sd = 0.2)
d <- data.frame(depth, solar_angle, height, age)
Height and age are highly correlated with solar angle:
cor(d[,c("solar_angle", "height", "age")])
## solar_angle height age
## solar_angle 1.0000000 0.8958601 0.9713247
## height 0.8958601 1.0000000 0.9281924
## age 0.9713247 0.9281924 1.0000000
Since the predictors are highly correlated, derive their residuals by regressing each on solar angle.
# residualization
d$height_r <- lm(height ~ solar_angle, data = d)$residuals
d$age_r <- lm(age ~ solar_angle, data = d)$residuals
Now their correlation with solar angle are basically 0 (though the residuals themselves are still somewhat correlated). This seems like a good thing!
round(cor(d[,c("solar_angle", "height_r", "age_r")]), 4)
## solar_angle height_r age_r
## solar_angle 1 0.0000 0.0000
## height_r 0 1.0000 0.5492
## age_r 0 0.5492 1.0000
Now fit an SEM in attempt to recover true values used to generate data.
library(piecewiseSEM)
fit1 <- psem(
lm(height_r ~ age_r, data = d),
lm(depth ~ solar_angle + height_r, data = d)
)
sout1 <- summary(fit1, .progressBar = FALSE)
sout1$coefficients
## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1 height_r age_r 0.5396 0.0829 98 6.5060 0 0.5492
## 2 depth solar_angle -0.4938 0.0170 97 -29.0262 0 -0.9341
## 3 depth height_r 0.4370 0.0857 97 5.0985 0 0.1641
##
## 1 ***
## 2 ***
## 3 ***
Notice the coefficient for height_r is 0.4370. That’s a fairly close estimate of the true effect (0.5).
But notice what happens when we use the raw values of height and age:
fit2 <- psem(
lm(height ~ age, data = d),
lm(depth ~ solar_angle + height, data = d)
)
sout2 <- summary(fit2, .progressBar = FALSE)
sout2$coefficients
## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1 height age 0.4880 0.0198 98 24.6940 0 0.9282
## 2 depth solar_angle -0.6687 0.0383 97 -17.4649 0 -1.2650
## 3 depth height 0.4370 0.0857 97 5.0985 0 0.3693
##
## 1 ***
## 2 ***
## 3 ***
The height coefficient is identical (0.4370) to the model that used residuals. In addition, the coefficient for solar angle is closer to the true value of -0.7 (though I don’t know if that will always be the case).
Therefore it does seem redundant to use residualization in this case.
In simple linear models, the coefficients for both height and age are equivalent.
mod1 <- lm(depth ~ solar_angle + height + age, data = d)
coef(mod1)[c("height", "age")]
## height age
## 0.44804735 -0.01981515
mod2 <- lm(depth ~ solar_angle + height_r + age_r, data = d)
coef(mod2)[c("height_r", "age_r")]
## height_r age_r
## 0.44804735 -0.01981515
The FWL theorem mentions the outcome should be residualized as well. Let’s try that.
d$depth_r <- lm(depth ~ solar_angle, data = d)$residuals
fit3 <- psem(
lm(height_r ~ age_r, data = d),
lm(depth_r ~ solar_angle + height_r, data = d)
)
sout3 <- summary(fit3, .progressBar = FALSE)
sout3$coefficients
## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1 height_r age_r 0.5396 0.0829 98 6.5060 0 0.5492
## 2 depth_r solar_angle 0.0000 0.0170 97 0.0000 1 0.0000
## 3 depth_r height_r 0.4370 0.0857 97 5.0985 0 0.4597
##
## 1 ***
## 2
## 3 ***
The height coefficient remains the same (0.4370), however the solar angle coefficient is now estimated to be 0. That’s not good.
In this toy example, it seems we’re better off working with the raw data despite the fact they’re highly correlated.
This is not intended to dismiss residualization (i.e., The partialling-out procedure) as a method. I just think for this particular SEM model it may not be needed.