Intro

The rvif (Redefined Variance Inflation Factor) package focuses on determining whether or not the degree of approximate multicollinearity in a multiple linear regression model is of concern, meaning that it affects the statistical analysis (i.e. individual significance tests) of the model. It does so by implementing a statistical test via the multicollinearity() function. From the package vignette: “the test determines whether the non-rejection of the null hypothesis in the individual significance tests of the coefficient associated with each independent variable is due to the degree of multicollinearity.”

The intro to the R Journal article quotes the O’Brien paper: “However, the measures traditionally applied to detect multicollinearity may conclude that multicollinearity exists even if it does not lead to the negative effects mentioned above, when, in fact, the best solution in this case may be not to treat the multicollinearity (see O’Brien (2007).”

Preliminaries

The four scenarios

“To determine whether the tendency not to reject the null hypothesis in the individual significance test is caused by a troubling approximate multicollinearity that inflates the variance of the estimator, or whether it is caused by variables not being statistically significantly related, the following situations are distinguished with a significance level \(\alpha\):”

a If the null hypothesis is initially rejected in the original model, the following results can be obtained for the orthonormal model:

b If the null hypothesis is not initially rejected in the original model, the following results may occur for the orthonormal model:

The decision rule

Screenshot from the rvif vignette:

In words, if RVIF exceeds the maximum of \(c_0(i)\) and \(c_3(i)\), then multicollinearity is affecting the statistical analysis of the model. See the R Journal article for derivations of these expressions.

Examples

library(rvif)
library(faraway)

Prostate data

Using the prostate data from the faraway package, model lpsa (log prostate specific antigen) as a function of all other variables and investigate multicollinearity using vif() from the car package.

data("prostate")
m <- lm(lpsa ~ ., data = prostate)
vif(m)
##   lcavol  lweight      age     lbph      svi      lcp  gleason    pgg45 
## 2.054115 1.363704 1.323599 1.375534 1.956881 3.097954 2.473411 2.974361

Now investigate multicollinearity using the rvifs() function from the rvif package. Notice we need to use model.matrix() to extract the right-hand side variables from the model.

  • RVIF = Redefined Variance Inflation Factor of each independent variable.
  • % = Percentage of near multicollinearity due to each independent variable.

Notice the intercept is included in the output. These RVIF values are much higher than VIF, but there is no threshold to consider.

rvifs(x = model.matrix(m))
##                  RVIF       %
## Intercept  324.836827 99.6922
## Variable 2   4.777115 79.0669
## Variable 3  75.902487 98.6825
## Variable 4  99.736712 98.9974
## Variable 5   1.382185 27.6508
## Variable 6   2.497599 59.9615
## Variable 7   3.149462 68.2485
## Variable 8 220.997936 99.5475
## Variable 9   5.220262 80.8439

Use the Decision Rule to Detect Troubling Multicollinearity. The function returns the value of the RVIF and the established thresholds (c0 and c3), as well as indicating whether or not the individual significance analysis is affected by multicollinearity at the chosen significance level.

multicollinearity(y = prostate[,"lpsa"], x = model.matrix(m))
##          RVIFs           c0           c3 Scenario Affects
## 1 3.348833e+00 2.260428e-01 0.0111399208      b.1     Yes
## 2 1.540289e-02 1.738641e-01 0.0004424191      a.1      No
## 3 5.759506e-02 1.042094e-01 0.0191900584      a.1      No
## 4 2.487380e-04 1.945627e-04 0.0011738455      b.2      No
## 5 6.807399e-03 5.782391e-03 0.0126207847      b.2      No
## 6 1.189333e-01 2.961675e-01 0.0396061480      a.1      No
## 7 1.650575e-02 5.612992e-03 0.2530673200      b.2      No
## 8 4.940709e-02 1.028147e-03 0.1384051616      b.2      No
## 9 3.894933e-05 1.033197e-05 0.0001468308      b.2      No

The individual significance test of the intercept is affected by the degree of existing multicollinearity. The Scenario of b.1 says that the intercept was not rejected in the original model but is rejected in the orthonormal reference model.

coef(summary(m))["(Intercept)",]
##   Estimate Std. Error    t value   Pr(>|t|) 
##  0.6693367  1.2963875  0.5163091  0.6069335

Savings rates data

Model savings rate (sr) as a function of pop15 (percent population under age of 15), pop75, (percent population over age of 75), dpi (per-capita disposable income in dollars), and ddpi (percent growth rate of dpi). View VIF.

data("savings")
m2 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
vif(m2)
##    pop15    pop75      dpi     ddpi 
## 5.937661 6.629105 2.884369 1.074309

Now look at RVIFs.

rvifs(x = model.matrix(m2))
##                  RVIF       %
## Intercept  187.025680 99.4653
## Variable 2  95.009426 98.9475
## Variable 3  27.976176 96.4255
## Variable 4   6.556330 84.7476
## Variable 5   2.953623 66.1433

Use the Decision Rule to Detect Troubling Multicollinearity.

multicollinearity(y = savings[,"sr"], x = model.matrix(m2))
##          RVIFs           c0           c3 Scenario Affects
## 1 3.740514e+00 1.391109e+01 4.692010e-02      a.1      No
## 2 1.446816e-03 3.625978e-03 4.157893e-04      a.1      No
## 3 8.120077e-02 4.877557e-02 8.929468e-02      b.2      No
## 4 5.995458e-08 1.934935e-09 2.836012e-07      b.2      No
## 5 2.662002e-03 2.861414e-03 2.476486e-03      a.1      No

According to the results, the level of multicollinearity detected does not affect the statistical analysis.

Simulated data

Simulate highly correlated data, generate an outcome (Y), and then fit the correct model.

library(MASS)
set.seed(1)
n <- 300
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0), 
                           Sigma = matrix(c(1,r,r, 
                                            r,1,r,
                                            r,r,1), 
                                          byrow = T, 
                                          nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
  rnorm(n = n, sd = 1.1)
m3 <- lm(Y ~ V1 + V2 + V3, data = d)
vif(m3)
##       V1       V2       V3 
## 54.52195 52.42181 57.53355

Now look at RVIFs:

rvifs(x = model.matrix(m3))
##                 RVIF       %
## Intercept   1.004846  0.4823
## Variable 2 54.607026 98.1687
## Variable 3 52.468519 98.0941
## Variable 4 57.603948 98.2640

Use the Decision Rule to Detect Troubling Multicollinearity.

multicollinearity(y = d[,"Y"], x = model.matrix(m3))
##         RVIFs           c0          c3 Scenario       Affects
## 1 0.003349487 0.0039348024 0.004787978      a.2 Contradiction
## 2 0.197220475 0.0009671839 0.004267780      b.1           Yes
## 3 0.186752687 0.0185034292 1.320402808      b.2            No
## 4 0.207587988 0.3379381655 0.127516738      a.1            No

The degree of multicollinearity in the model affects the individual significance of variable 1 (V1). The “Contradiction” result for the intercept is due to scenario “a.2” where the intercept test statistic was rejected in the original model but not in the reference orthonormal model.

View the confidence intervals on the coefficients. Notice the interval of V1 [-1.05508392, 0.9169819] contains 0 and is not significant. The “true” value of the coefficient is 0.5. The test on V1 is not significant in the model, but is significant in the reference orthonormal model.

confint(m3)
##                   2.5 %    97.5 %
## (Intercept)  0.01077577 0.2677766
## V1          -1.05508392 0.9169819
## V2          -1.26153260 0.6574845
## V3           0.27910799 2.3023439

Set a new seed, double the sample size and check results.

set.seed(11)
n <- 600
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0), 
                           Sigma = matrix(c(1,r,r, 
                                            r,1,r,
                                            r,r,1), 
                                          byrow = T, 
                                          nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
  rnorm(n = n, sd = 1.1)
m3 <- lm(Y ~ V1 + V2 + V3, data = d)
multicollinearity(y = d[,"Y"], x = model.matrix(m3))
##         RVIFs          c0           c3 Scenario Affects
## 1 0.001667416 0.007239347 0.0004419016      a.1      No
## 2 0.113089227 0.072965163 0.0009234877      b.1     Yes
## 3 0.107643540 0.012047303 9.5106673247      b.2      No
## 4 0.113329710 0.086660972 0.1482053903      b.2      No

Collinearity again is affecting V1. Check the confidence intervals. The interval for V1 is wide and includes 0. The “true” value of the coefficient is 0.5. The test on V1 is not significant in the model, but is significant in the reference orthonormal model.

confint(m3)
##                   2.5 %    97.5 %
## (Intercept)  0.09410333 0.2677796
## V1          -0.14071109 1.2895950
## V2          -0.93113933 0.4643045
## V3          -0.08987580 1.3419502

Limitations/Questions

library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
multicollinearity(y = sleepstudy[,"Reaction"], x = model.matrix(fm1))

See also

References