Investigating Residualization

Intro

Student is studying variation in forest canopy depth across the eastern United States. She is examining how canopy depth changes with latitude and to identify “key drivers” at both the among-site and within-site scales. The study includes 11 forest sites spanning a broad range of latitudes and five potential predictors.

In the document sent before the meeting, the following statement is made (emphasis mine):

To further address collinearity and enable causal interpretation, we separated shared and independent effects prior to structural modelling. Forest age and canopy height were each regressed against solar angle, and then their residuals—representing variation independent of latitude—were used in a structural equation model (SEM) to quantify direct and indirect pathways on canopy depth.

This method was unfamiliar to me so I did some investigating.

The Frisch-Waugh-Lovell Theorem

This idea of using residuals as predictors apparently comes from the Frisch-Waugh-Lovell Theorem. This page provides an accessible overview along with a demo in R. The main attraction is that it can create orthogonal predictors with no correlation. But the new predictors (the residuals) are estimates themselves, so additional error may propagate through the model. This is also known as the Partialling-out Procedure. (See graphic at the top of this page.)

The SEM model

Here is the SEM model the student fit.

In words, “age affects height, which affects depth. Solar angle also affects depth.” In this model she is using residuals for the age and height predictors. In other words, she fit the following models and extracted the residuals. Something like this, I think:

age_resid <- lm(age ~ solar_angle)$residuals
height_resid <- lm(height ~ solar_angle)$residuals

In the SEM, age_resid and height_resid are being used as the predictors. Presumably this is desirable because the raw age and height values are highly correlated with solar angle. The derived residual values are orthogonal to solar angle. However, since solar angle is also in the model, the estimated coefficients should be the same. So the step of deriving the residuals and using them in the model should be redundant.

Demonstration

Simulate correlated data according to the SEM path model above. The “true” coefficients are as follows:

Age –> Height: 0.6
Height –> Depth: 0.5
Solar Angle –> Depth: -0.7

set.seed(4102026)
n <- 100
solar_angle <- rnorm(n)  
age <- 0.8*solar_angle + rnorm(n, sd = 0.2)  # induce correlation with solar_angle
height <- 0.5*age + rnorm(n, sd = 0.2)
depth <- -0.7*solar_angle + 0.5*height + rnorm(n, sd = 0.2)
d <- data.frame(depth, solar_angle, height, age)

Height and age are highly correlated with solar angle:

cor(d[,c("solar_angle", "height", "age")])

##             solar_angle    height       age
## solar_angle   1.0000000 0.8958601 0.9713247
## height        0.8958601 1.0000000 0.9281924
## age           0.9713247 0.9281924 1.0000000

Since the predictors are highly correlated, derive their residuals by regressing each on solar angle.

# residualization
d$height_r <- lm(height ~ solar_angle, data = d)$residuals
d$age_r <- lm(age ~ solar_angle, data = d)$residuals

Now their correlation with solar angle are basically 0 (though the residuals themselves are still somewhat correlated). This seems like a good thing!

round(cor(d[,c("solar_angle", "height_r", "age_r")]), 4)

##             solar_angle height_r  age_r
## solar_angle           1   0.0000 0.0000
## height_r              0   1.0000 0.5492
## age_r                 0   0.5492 1.0000

Now fit an SEM in attempt to recover true values used to generate data.

library(piecewiseSEM)
fit1 <- psem(
  lm(height_r ~ age_r, data = d),
  lm(depth ~ solar_angle + height_r, data = d)
  )
sout1 <- summary(fit1, .progressBar = FALSE)
sout1$coefficients

##   Response   Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1 height_r       age_r   0.5396    0.0829 98     6.5060       0       0.5492
## 2    depth solar_angle  -0.4938    0.0170 97   -29.0262       0      -0.9341
## 3    depth    height_r   0.4370    0.0857 97     5.0985       0       0.1641
##      
## 1 ***
## 2 ***
## 3 ***

Notice the coefficient for height_r is 0.4370. That’s a fairly close estimate of the true effect (0.5).

But notice what happens when we use the raw values of height and age:

fit2 <- psem(
  lm(height ~ age, data = d),
  lm(depth ~ solar_angle + height, data = d)
  )
sout2 <- summary(fit2, .progressBar = FALSE)
sout2$coefficients

##   Response   Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1   height         age   0.4880    0.0198 98    24.6940       0       0.9282
## 2    depth solar_angle  -0.6687    0.0383 97   -17.4649       0      -1.2650
## 3    depth      height   0.4370    0.0857 97     5.0985       0       0.3693
##      
## 1 ***
## 2 ***
## 3 ***

The height coefficient is identical (0.4370) to the model that used residuals. In addition, the coefficient for solar angle is closer to the true value of -0.7 (though I don’t know if that will always be the case).

Therefore it does seem redundant to use residualization in this case.

Simple linear models

In simple linear models, the coefficients for both height and age are equivalent.

mod1 <- lm(depth ~ solar_angle + height + age, data = d)
coef(mod1)[c("height", "age")]

##      height         age 
##  0.44804735 -0.01981515

mod2 <- lm(depth ~ solar_angle + height_r + age_r, data = d)
coef(mod2)[c("height_r", "age_r")]

##    height_r       age_r 
##  0.44804735 -0.01981515

Residualize outcome as well

The FWL theorem mentions the outcome should be residualized as well. Let’s try that.

d$depth_r <- lm(depth ~ solar_angle, data = d)$residuals

fit3 <- psem(
  lm(height_r ~ age_r, data = d),
  lm(depth_r ~ solar_angle + height_r, data = d)
  )
sout3 <- summary(fit3, .progressBar = FALSE)
sout3$coefficients

##   Response   Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1 height_r       age_r   0.5396    0.0829 98     6.5060       0       0.5492
## 2  depth_r solar_angle   0.0000    0.0170 97     0.0000       1       0.0000
## 3  depth_r    height_r   0.4370    0.0857 97     5.0985       0       0.4597
##      
## 1 ***
## 2    
## 3 ***

The height coefficient remains the same (0.4370), however the solar angle coefficient is now estimated to be 0. That’s not good.

In this toy example, it seems we’re better off working with the raw data despite the fact they’re highly correlated.

Disclaimer

This is not intended to dismiss residualization (i.e., The partialling-out procedure) as a method. I just think for this particular SEM model it may not be needed.

References

Discussion with Claude AI
The FWL Theorem: Making Multivariate Regressions Intuitive
Fundamental Theorems for Econometrics
Lefcheck JS (2016). “piecewiseSEM: Piecewise structural equation modeling in R for ecology, evolution, and systematics.” Methods in Ecology and Evolution,7(5), 573-579. doi:10.1111/2041-210X.12512 https://doi.org/10.1111/2041-210X.12512
An Introduction to Structural Equation Modeling in R