The rvif (Redefined Variance Inflation Factor) package focuses on
determining whether or not the degree of approximate multicollinearity
in a multiple linear regression model is of concern, meaning that it
affects the statistical analysis (i.e. individual significance tests) of
the model. It does so by implementing a statistical test via the
multicollinearity() function. From the package vignette:
“the test determines whether the non-rejection of the null hypothesis in
the individual significance tests of the coefficient associated with
each independent variable is due to the degree of
multicollinearity.”
The intro to the R Journal article quotes the O’Brien paper: “However, the measures traditionally applied to detect multicollinearity may conclude that multicollinearity exists even if it does not lead to the negative effects mentioned above, when, in fact, the best solution in this case may be not to treat the multicollinearity (see O’Brien (2007).”
“To determine whether the tendency not to reject the null hypothesis in the individual significance test is caused by a troubling approximate multicollinearity that inflates the variance of the estimator, or whether it is caused by variables not being statistically significantly related, the following situations are distinguished with a significance level \(\alpha\):”
a If the null hypothesis is initially rejected in the original model, the following results can be obtained for the orthonormal model:
b If the null hypothesis is not initially rejected in the original model, the following results may occur for the orthonormal model:
Screenshot from the rvif vignette:
In words, if RVIF exceeds the maximum of \(c_0(i)\) and \(c_3(i)\), then multicollinearity is affecting the statistical analysis of the model. See the R Journal article for derivations of these expressions.
library(rvif)
library(faraway)
Using the prostate data from the faraway package, model lpsa (log
prostate specific antigen) as a function of all other variables and
investigate multicollinearity using vif() from the car
package.
data("prostate")
m <- lm(lpsa ~ ., data = prostate)
vif(m)
## lcavol lweight age lbph svi lcp gleason pgg45
## 2.054115 1.363704 1.323599 1.375534 1.956881 3.097954 2.473411 2.974361
Now investigate multicollinearity using the rvifs()
function from the rvif package. Notice we need to use
model.matrix() to extract the right-hand side variables
from the model.
Notice the intercept is included in the output. These RVIF values are much higher than VIF, but there is no threshold to consider.
rvifs(x = model.matrix(m))
## RVIF %
## Intercept 324.836827 99.6922
## Variable 2 4.777115 79.0669
## Variable 3 75.902487 98.6825
## Variable 4 99.736712 98.9974
## Variable 5 1.382185 27.6508
## Variable 6 2.497599 59.9615
## Variable 7 3.149462 68.2485
## Variable 8 220.997936 99.5475
## Variable 9 5.220262 80.8439
Use the Decision Rule to Detect Troubling Multicollinearity. The function returns the value of the RVIF and the established thresholds (c0 and c3), as well as indicating whether or not the individual significance analysis is affected by multicollinearity at the chosen significance level.
multicollinearity(y = prostate[,"lpsa"], x = model.matrix(m))
## RVIFs c0 c3 Scenario Affects
## 1 3.348833e+00 2.260428e-01 0.0111399208 b.1 Yes
## 2 1.540289e-02 1.738641e-01 0.0004424191 a.1 No
## 3 5.759506e-02 1.042094e-01 0.0191900584 a.1 No
## 4 2.487380e-04 1.945627e-04 0.0011738455 b.2 No
## 5 6.807399e-03 5.782391e-03 0.0126207847 b.2 No
## 6 1.189333e-01 2.961675e-01 0.0396061480 a.1 No
## 7 1.650575e-02 5.612992e-03 0.2530673200 b.2 No
## 8 4.940709e-02 1.028147e-03 0.1384051616 b.2 No
## 9 3.894933e-05 1.033197e-05 0.0001468308 b.2 No
The individual significance test of the intercept is affected by the degree of existing multicollinearity. The Scenario of b.1 says that the intercept was not rejected in the original model but is rejected in the orthonormal reference model.
coef(summary(m))["(Intercept)",]
## Estimate Std. Error t value Pr(>|t|)
## 0.6693367 1.2963875 0.5163091 0.6069335
Model savings rate (sr) as a function of pop15 (percent population under age of 15), pop75, (percent population over age of 75), dpi (per-capita disposable income in dollars), and ddpi (percent growth rate of dpi). View VIF.
data("savings")
m2 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
vif(m2)
## pop15 pop75 dpi ddpi
## 5.937661 6.629105 2.884369 1.074309
Now look at RVIFs.
rvifs(x = model.matrix(m2))
## RVIF %
## Intercept 187.025680 99.4653
## Variable 2 95.009426 98.9475
## Variable 3 27.976176 96.4255
## Variable 4 6.556330 84.7476
## Variable 5 2.953623 66.1433
Use the Decision Rule to Detect Troubling Multicollinearity.
multicollinearity(y = savings[,"sr"], x = model.matrix(m2))
## RVIFs c0 c3 Scenario Affects
## 1 3.740514e+00 1.391109e+01 4.692010e-02 a.1 No
## 2 1.446816e-03 3.625978e-03 4.157893e-04 a.1 No
## 3 8.120077e-02 4.877557e-02 8.929468e-02 b.2 No
## 4 5.995458e-08 1.934935e-09 2.836012e-07 b.2 No
## 5 2.662002e-03 2.861414e-03 2.476486e-03 a.1 No
According to the results, the level of multicollinearity detected does not affect the statistical analysis.
Simulate highly correlated data, generate an outcome (Y), and then fit the correct model.
library(MASS)
set.seed(1)
n <- 300
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0),
Sigma = matrix(c(1,r,r,
r,1,r,
r,r,1),
byrow = T,
nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
rnorm(n = n, sd = 1.1)
m3 <- lm(Y ~ V1 + V2 + V3, data = d)
vif(m3)
## V1 V2 V3
## 54.52195 52.42181 57.53355
Now look at RVIFs:
rvifs(x = model.matrix(m3))
## RVIF %
## Intercept 1.004846 0.4823
## Variable 2 54.607026 98.1687
## Variable 3 52.468519 98.0941
## Variable 4 57.603948 98.2640
Use the Decision Rule to Detect Troubling Multicollinearity.
multicollinearity(y = d[,"Y"], x = model.matrix(m3))
## RVIFs c0 c3 Scenario Affects
## 1 0.003349487 0.0039348024 0.004787978 a.2 Contradiction
## 2 0.197220475 0.0009671839 0.004267780 b.1 Yes
## 3 0.186752687 0.0185034292 1.320402808 b.2 No
## 4 0.207587988 0.3379381655 0.127516738 a.1 No
The degree of multicollinearity in the model affects the individual significance of variable 1 (V1). The “Contradiction” result for the intercept is due to scenario “a.2” where the intercept test statistic was rejected in the original model but not in the reference orthonormal model.
View the confidence intervals on the coefficients. Notice the interval of V1 [-1.05508392, 0.9169819] contains 0 and is not significant. The “true” value of the coefficient is 0.5. The test on V1 is not significant in the model, but is significant in the reference orthonormal model.
confint(m3)
## 2.5 % 97.5 %
## (Intercept) 0.01077577 0.2677766
## V1 -1.05508392 0.9169819
## V2 -1.26153260 0.6574845
## V3 0.27910799 2.3023439
Set a new seed, double the sample size and check results.
set.seed(11)
n <- 600
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0),
Sigma = matrix(c(1,r,r,
r,1,r,
r,r,1),
byrow = T,
nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
rnorm(n = n, sd = 1.1)
m3 <- lm(Y ~ V1 + V2 + V3, data = d)
multicollinearity(y = d[,"Y"], x = model.matrix(m3))
## RVIFs c0 c3 Scenario Affects
## 1 0.001667416 0.007239347 0.0004419016 a.1 No
## 2 0.113089227 0.072965163 0.0009234877 b.1 Yes
## 3 0.107643540 0.012047303 9.5106673247 b.2 No
## 4 0.113329710 0.086660972 0.1482053903 b.2 No
Collinearity again is affecting V1. Check the confidence intervals. The interval for V1 is wide and includes 0. The “true” value of the coefficient is 0.5. The test on V1 is not significant in the model, but is significant in the reference orthonormal model.
confint(m3)
## 2.5 % 97.5 %
## (Intercept) 0.09410333 0.2677796
## V1 -0.14071109 1.2895950
## V2 -0.93113933 0.4643045
## V3 -0.08987580 1.3419502
multicollinearity() handles interactions
and non-linear terms. Neither the R journal article or rvif vignettes
addresses these topics.multicollinearity().library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
multicollinearity(y = sleepstudy[,"Reaction"], x = model.matrix(fm1))
multicollinearity() provides 10
additional examples.cv_vif() function provides the values for the
Variance Inflation Factor (VIF) and the Coefficient of Variation (CV)
for the independent variables (excluding the intercept) in a multiple
linear regression model. From the help page: “It is interesting to note
the distinction between essential and non-essential multicollinearity.
Essential multicollinearity happens when there is an approximate linear
relationship between two or more independent variables (not including
the intercept) while non-essential multicollinearity involves a linear
relationship between the intercept and at least one independent
variable. This distinction matters because the Variance Inflation Factor
(VIF) only detects essential multicollinearity, while the Condition
Value (CV) is useful for detecting only non-essential multicollinearity.
Understanding the distinction between essential and non-essential
multicollinearity and the limitations of each detection measure, can be
very useful for identifying whether there is a troubling degree of
multicollinearity, and determining the kind of multicollinearity present
and the variables causing it.”plot() method for class “cv_vif” produces a
scatterplot of Coefficient of Variation (CV) versus the Variance
Inflation Factor (VIF) for the independent variables (excluding the
intercept) of a multiple linear regression model. “The distinction
between essential and non-essential multicollinearity and the
limitations of each measure (CV and VIF) for detecting the different
kinds of multicollinearity, can be very useful for identifying whether
there is a troubling degree of multicollinearity, and determining the
kind of multicollinearity present and the variables causing it.”check_collinearity() function that works from lm and
mixed-effect models. See this
blog post.