We start first by loading up the following packages:
HistData
, car
, and stargazer
.
#load up necessary packages
library(HistData)
library(car)
library(stargazer)
We want to determine a straight line that determines the relationship between x, some independent (predictor) variable, and y, an outcome or dependent variable.
You will recall from high school algebra that the equation for a straight line is:
\[\begin{equation} y = \alpha + \beta x \end{equation}\]
where \(\alpha\) is the y-intercept (the value of y when x=0) and \(\beta\) is the slope of the line (the amount that y will increase by with every unit increase in x).
A classic example is the height data from Sir Francis Galton, a 19th century statistician. He collected height data on 905 children born to 205 parents. The dataset is part of the R package HistData, which contains Data Sets from the History of Statistics and Data Visualization. The variables are the height of the child (as an adult) and the mid-parent height, that is the average of the parents’ heights.
If you go to the Environment tab, you will see the Global Environment. Click on “package:HistData” and you will see the Galton dataset.
Let’s see what is in there:
# see what is in the Galton dataset
summary(Galton)
## parent child
## Min. :64.00 Min. :61.70
## 1st Qu.:67.50 1st Qu.:66.20
## Median :68.50 Median :68.20
## Mean :68.31 Mean :68.09
## 3rd Qu.:69.50 3rd Qu.:70.20
## Max. :73.00 Max. :73.70
Let’s develop the model on ~90% of the dataset, then test it on the remaining ~10% of the data. In the datascience world, we call that first step “training” the model.
# divide the dataset into a training
# and a testing set based on a
# random uniform number on fixed seed
# set a "seed" for the random number generator
set.seed(20170208)
# create a series of random numbers from
# 0 to 1 using a uniform distribution function
Galton$group <- runif(length(Galton$parent),
min = 0, max = 1)
summary(Galton)
## parent child group
## Min. :64.00 Min. :61.70 Min. :0.002463
## 1st Qu.:67.50 1st Qu.:66.20 1st Qu.:0.245599
## Median :68.50 Median :68.20 Median :0.518798
## Mean :68.31 Mean :68.09 Mean :0.501614
## 3rd Qu.:69.50 3rd Qu.:70.20 3rd Qu.:0.743939
## Max. :73.00 Max. :73.70 Max. :0.999420
# group should look approximately uniform
hist(Galton$group)
# using the new random number, split
# the file: use 90% to train and 10% to test
Galton.train <- subset(Galton, group <= 0.90)
Galton.test <- subset(Galton, group > 0.90)
Let’s look at a plot of the children’s heights (Y-axis) versus the average height of the parents (X-axis).
Now let’s graph the training dataset (90%) and the test set data (the smaller 10%).
# graph child on parent for training dataset
plot(child ~ parent, data = Galton.train,
main = "Galton Training Dataset: Children vs Parents Heights")
# graph child on parent for testing dataset
plot(child ~ parent, data = Galton.test,
main = "Galton Test Dataset: Children vs Parents Heights")
Let’s do the regression now on the training set. Linear regression is performed with the R function “lm” and takes the form
object.name <- lm(y ~ x, data = data_set_name)
Let’s do that now with the Galton data - using the training dataset.
Save the regression output as reg1
.
Look at reg1
.
# linear regression of child height on
# mid-parent height in the training dataset
reg1 <- lm(child ~ parent, data = Galton.train)
reg1
##
## Call:
## lm(formula = child ~ parent, data = Galton.train)
##
## Coefficients:
## (Intercept) parent
## 23.8537 0.6474
The output is really simple.
Let’s do a summary()
of the regression output
summary(reg1)
.
# model summary
summary(reg1)
##
## Call:
## lm(formula = child ~ parent, data = Galton.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7923 -1.3502 0.0604 1.6498 5.9445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.85366 2.99886 7.954 5.9e-15 ***
## parent 0.64736 0.04389 14.751 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.249 on 829 degrees of freedom
## Multiple R-squared: 0.2079, Adjusted R-squared: 0.2069
## F-statistic: 217.6 on 1 and 829 DF, p-value: < 2.2e-16
# save model summary details
sreg1 <- summary(reg1)
Much more detailed output.
Interpretation: for each increase in parent height of 1 inch, the child height increases by 0.65 inches.
Now the way that this is working is to estimate values for \(\alpha\) and \(\beta\) such that when you plug in your given independent variables, you get predicted dependent variables that are close to the observed values. In statistics we optimize this closeness by minimizing the sum-of-squared-residuals, that is
\[\begin{equation} \sum\limits_{i=1}^n (Y_{obs.i} - Y_{pred.i})^2 \end{equation}\]
So, let’s look at how we did.
predict()
function.# get predicted values in the training and testing dataset
Galton.train$pred.child <-
predict(reg1, newdata = Galton.train)
Galton.test$pred.child <-
predict(reg1, newdata=Galton.test)
# calculate residuals in the training and testing dataset
Galton.train$resid <-
Galton.train$child - Galton.train$pred.child
Galton.test$resid <-
Galton.test$child - Galton.test$pred.child
Now that we have calculated these values, let’s look at some simple
plots. The Companion to Applied Regression (aka car
)
package, has some good functionality for this:
library(car)
#get the residual plots
residualPlots(reg1)
## Test stat Pr(>|Test stat|)
## parent 1.4592 0.1449
## Tukey test 1.4592 0.1445
Let’s look at how well we do in the testing dataset. First let’s plot the TEST data.
#plot test dataset
plot(child ~ parent, data = Galton.test)
#overlay the regression line
abline(a=23.85, b=0.65)
library(ggplot2)
ggplot(Galton.test, aes(x=parent, y=child)) +
geom_point() +
geom_smooth(method = "lm")
Now let’s look at the residual plots:
Do the residuals have any pattern with the value of the predictor?
#plot residual versus X
plot(resid ~ parent, data = Galton.test)
#overlay the zero line
abline(0,0)
Do the residuals show any pattern with the predicted values?
#plot residual v predicted
plot(resid ~ pred.child, data = Galton.test)
One assumption in the statistical analysis of a linear regression are the hypothesis tests. We usually want to test a null hypothesis that the slope is 0, i.e., there is no relationship between y and x. We express this mathematically as:
\[\begin{equation} H_0: \beta = 0 \end{equation}\]
We test a null hypothesis against an alternative hypothesis, ie, the slope is not equal to 0, i.e., there is some (positive or negative) relationship between y and x. We express this mathematically as:
\[\begin{equation} H_A: \beta \neq 0 \end{equation}\]
As you have seen in these data, the y-values do not line up as a perfect linear function of x, i.e. child height does not line up as a perfect linear function of parent height. There is an error that occurs for each observation. So we are really modeling a statistical model
\[\begin{equation} Y = \alpha + \beta x + \epsilon \end{equation}\]
Recall from above that we estimate \(\alpha\) and \(\beta\) to minimize the sum of squared residuals.
To do a statistical test on the slopes we have to make some assumptions. One of the assumptions is that \(\epsilon\) ~ N(0, \(\sigma^2\)), that is, that the errors between the observed and predicted values of Y take on a normal distribution with mean of 0 and variance of \(\sigma^2\).
We can formally test this assumption, but we can also do a qq-plot which will give us a visual representation of the same. We get this plot as follows:
Do the residuals look normally distributed (from the model fit to the training data)?
# base R approach
hist(Galton.train$resid)
qqnorm(Galton.train$resid)
# using car package qqPlot()
# put in the model object
qqPlot(reg1)
## 1 13
## 1 11
In this last plot (from qqPlot(reg1)
using the
car
package) we have taken each residual and divided by the
estimated standard deviation to create studentized residuals. We then
rank them and calculate the percentile for each studentized residual,
then create this graph. Of course, the car package is doing all of this
under the hood, so to speak.
To do the significance test, recall the summary of the linear regression model we had above.
##
## Call:
## lm(formula = child ~ parent, data = Galton.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7923 -1.3502 0.0604 1.6498 5.9445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.85366 2.99886 7.954 5.9e-15 ***
## parent 0.64736 0.04389 14.751 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.249 on 829 degrees of freedom
## Multiple R-squared: 0.2079, Adjusted R-squared: 0.2069
## F-statistic: 217.6 on 1 and 829 DF, p-value: < 2.2e-16
Note that there are two estimated coefficients, so our estimated line is
\[\begin{equation} child = 23.85 + 0.65*parent \end{equation}\]
You will also notice that next to the estimate, there is a column for standard error, a column for t-value, and column for p-value. We divide the estimate by the standard error to compute the t-value, which then has 829 degrees of freedom. For the slope estimate, we calculate the the p-value is < 2e-16, i.e., highly significantly different from 0.
At the bottom of summary you will notice a line for the F-statistics, which is the ratio of the mean-squared error of the model to the mean-squared error of the residuals. This has an F-distribution with 1 and 829 degrees of freedom, and takes on the value of 217.6 which has a p-value < 2e-16. Since this is a univariate regression you will see that if you take the value of the t-statistic for the slope and square it, that will give you the value of the F-statistic.
Isn’t math fun?
Suppose you have more than one independent variable that you think explains the dependent variable. That is instead of the simple univariate case of
\[\begin{equation} y = \alpha + \beta x \end{equation}\]
You have the multivariate case of
\[\begin{equation} y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \end{equation}\]
As an example consider the PRESTIGE dataset (which comes with the
car
package). It consists of 102 observations and 6
variables as follows:
education: The average number of years of education
for occupational incumbents.
income: The average income of occupational incumbents,
in dollars.
women: The percentage of women in the occupation.
prestige: The average prestige rating for the
occupation.
census: The code of the occupation used in the
survey.
type: Professional and managerial(prof), white
collar(wc), blue collar(bc), or missing(NA).
We want to determine how the prestige of an occupation is related to average income, education (average number of years for incumbents), and the percentage of women in the occupation.
Let’s explore first:
#explore the 5 m's for the Prestige dataset
summary(Prestige)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
Now let’s see some plots:
# produce a scatterplot matrix
# using the function from the car package
# note: span adjusts the amount of smoothing
car::scatterplotMatrix(~ prestige + income +
education + women,
span = 0.7,
data = Prestige)
Hmmm…the variable income looks kind of wonky, for lack of a better term. So let’s see if a transformation will help:
Let’s use log2()
function to transform income.
NOTE: there are multiple options for the log() function:
log()
is the “natural” log - base “e” (exp(1) =
2.718282…)log10()
is log - base 10 (think decibels or Richter
Scale)log2()
is the “natural” log - base 2car::scatterplotMatrix( ~ prestige + log2(income) +
education + women,
span = 0.7,
data = Prestige)
Well, that looks a little better.
Now let’s see what the regression of prestige on income (logged), education, and women looks like:
# let the regression rip
prestige.mod1 <- lm(prestige ~ education +
log2(income) + women,
data = Prestige)
#and see what we have
summary(prestige.mod1)
##
## Call:
## lm(formula = prestige ~ education + log2(income) + women, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.364 -4.429 -0.101 4.316 19.179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -110.9658 14.8429 -7.476 3.27e-11 ***
## education 3.7305 0.3544 10.527 < 2e-16 ***
## log2(income) 9.3147 1.3265 7.022 2.90e-10 ***
## women 0.0469 0.0299 1.568 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.093 on 98 degrees of freedom
## Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
## F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
Some helpful packages for formatting regression models using Rmarkdown:
sjPlot
gtsummary
apaTables
sjPlot
packageGood News: This works very well knitting to HTML - once created you can cut and paste from HTML into MS WORD DOC format.
Bad News: This does NOT knit directly to DOC or PDF.
library(sjPlot)
sjPlot::tab_model(prestige.mod1)
prestige | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | -110.97 | -140.42 – -81.51 | <0.001 |
education | 3.73 | 3.03 – 4.43 | <0.001 |
income [log2] | 9.31 | 6.68 – 11.95 | <0.001 |
women | 0.05 | -0.01 – 0.11 | 0.120 |
Observations | 102 | ||
R2 / R2 adjusted | 0.835 / 0.830 |
apaTables
packageBad News: This creates a file OUTSIDE of the Rmarkdown file - the file is saved in your current R project directory (given the code below).
Good News: The cool thing is the output file is ready to go in MS WORD (*.DOC) format!!
library(apaTables)
apa.reg.table(prestige.mod1,
filename = "Table2_APA.doc",
table.number = 2)
##
##
## Table 2
##
## Regression results using prestige as the criterion
##
##
## Predictor b b_95%_CI beta beta_95%_CI sr2 sr2_95%_CI
## (Intercept) -110.97** [-140.42, -81.51]
## education 3.73** [3.03, 4.43] 0.59 [0.48, 0.70] .19 [.10, .27]
## log2(income) 9.31** [6.68, 11.95] 0.46 [0.33, 0.59] .08 [.03, .13]
## women 0.05 [-0.01, 0.11] 0.09 [-0.02, 0.20] .00 [-.01, .01]
##
##
##
## r Fit
##
## .85**
## .74**
## -.12
## R2 = .835**
## 95% CI[.77,.87]
##
##
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights.
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
##
Sneak Peek - image of WORD DOC table created
Table 2 APA DOC file
gtsummary
packageThis is probably the MOST flexible since you can KNIT directly to HTML, DOC or PDF. See more details at https://www.danieldsjoberg.com/gtsummary/articles/rmarkdown.html.
library(gtsummary)
tbl_mod1 <- tbl_regression(prestige.mod1)
tbl_mod1
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
education | 3.7 | 3.0, 4.4 | <0.001 |
log2(income) | 9.3 | 6.7, 12 | <0.001 |
women | 0.05 | -0.01, 0.11 | 0.12 |
1 CI = Confidence Interval |
From this output we see that if education increases by 1 year, then average prestige rating will increase by 3.73 units, with income and women held constant.
For education we note that the p-value is < 2e-16.
So we interpret that to reject the null hypothesis H_0: \(\beta_1\) = 0.
How do interpret the contributions of log2(income) and women?
Another point to remark: the multiple R-squared is 0.84, indicating that 84% of the variability in average prestige rating is due to these three independent variables.
Finally, the F-statistic of 165.4 is for testing the null hypothesis H_0: \(\beta_1 = \beta_2 = \beta_3 = 0\). It is highly significant. (and that “squaring the t to get the F” trick only works for the univariate regression case…)
Since the p-value for women does not reach the traditional significance level of 0.05, we might consider removing it from our model. Let’s see what happens then…
#run the model with only two predictors
prestige.mod2 <- lm(prestige ~ education +
log2(income), data= Prestige)
# look at the results
summary(prestige.mod2)
##
## Call:
## lm(formula = prestige ~ education + log2(income), data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0346 -4.5657 -0.1857 4.0577 18.1270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -95.1940 10.9979 -8.656 9.27e-14 ***
## education 4.0020 0.3115 12.846 < 2e-16 ***
## log2(income) 7.9278 0.9961 7.959 2.94e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared: 0.831, Adjusted R-squared: 0.8275
## F-statistic: 243.3 on 2 and 99 DF, p-value: < 2.2e-16
gtsummary
packagetbl_mod2 <- tbl_regression(prestige.mod2)
tbl_mod2
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
education | 4.0 | 3.4, 4.6 | <0.001 |
log2(income) | 7.9 | 6.0, 9.9 | <0.001 |
1 CI = Confidence Interval |
Hmmmm….education and log2(income) remain highly significant, there is little reduction in R-squared, the model is still significant.
stargazer
packageWe can compare the results of these two analyses a little more
cleanly using the stargazer
package as follows:
NOTE: stargazer
works well if you
are “knitting” to either HTML or PDF. This will NOT work is you are
“knitting” to WORD (DOCX) formats.
The code below is for the HTML output. Change to PDF of you want that output.
# compare the results of the two regression models
stargazer(prestige.mod1, prestige.mod2,
title="Comparison of 2 Regression outputs",
type = "html", align=TRUE)
Dependent variable: | ||
prestige | ||
(1) | (2) | |
education | 3.731*** | 4.002*** |
(0.354) | (0.312) | |
log2(income) | 9.315*** | 7.928*** |
(1.327) | (0.996) | |
women | 0.047 | |
(0.030) | ||
Constant | -110.966*** | -95.194*** |
(14.843) | (10.998) | |
Observations | 102 | 102 |
R2 | 0.835 | 0.831 |
Adjusted R2 | 0.830 | 0.828 |
Residual Std. Error | 7.093 (df = 98) | 7.145 (df = 99) |
F Statistic | 165.428*** (df = 3; 98) | 243.323*** (df = 2; 99) |
Note: | p<0.1; p<0.05; p<0.01 |
gtsummary
merged tablesThis is a little friendlier to different knitted outputs but does not include the model comparison statistics.
tbl_mod12 <-
tbl_merge(
tbls = list(tbl_mod1, tbl_mod2),
tab_spanner = c("**Model 1**", "**Model 2**")
)
tbl_mod12
Characteristic | Model 1 | Model 2 | ||||
---|---|---|---|---|---|---|
Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
education | 3.7 | 3.0, 4.4 | <0.001 | 4.0 | 3.4, 4.6 | <0.001 |
log2(income) | 9.3 | 6.7, 12 | <0.001 | 7.9 | 6.0, 9.9 | <0.001 |
women | 0.05 | -0.01, 0.11 | 0.12 | |||
1 CI = Confidence Interval |
We conclude that we lose little by eliminating a predictor variable. This is also consistent with a principle of parsimony, also known as the KISS principle, or “Keep It Simple, Silly”.
Let’s finish with some diagnostics. First the plots for the firs model with 3 independent variables:
# diagnostics for the first model
# with 3 independent variables
# function from car package
residualPlots(prestige.mod1)
## Test stat Pr(>|Test stat|)
## education 1.7555 0.08233 .
## log2(income) 0.5851 0.55981
## women 2.2770 0.02499 *
## Tukey test 0.7630 0.44545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that there is a non-zero trend for the variable women.
And now the plots for the second model with 2 independent variables
# diagnostics for the second model
# with 2 independent variables
residualPlots(prestige.mod2)
## Test stat Pr(>|Test stat|)
## education 1.6153 0.1095
## log2(income) 0.2213 0.8253
## Tukey test 0.6530 0.5137
Now the plots look really good, much better.
Another diagnostic tool is the added variable plot, that is the additional benefit of variable i given that all of the others are in. In this particular plot we can also identify the most influential observations.
#added variable plots - car function
avPlots(prestige.mod1, id.n=2, id.cex=0.7)
#id.n - identify n most influential observations
#id.cex - controls the size of the dot
Let’s run the qq-plot for model 1:
# run the qq-plot
qqPlot(prestige.mod1, id.n=3)
## collectors farmers
## 46 67
# here, id.n identifies the n observations with the largest residuals in absolute value
Are there any outliers?
We can also run a Bonferroni test for outliers.
#run Bonferroni test for outliers
outlierTest(prestige.mod1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## farmers 2.857462 0.0052259 0.53305
Are there any points that are of high influence?
#identify highly influential points
influenceIndexPlot(prestige.mod1, id.n=3)
NB. If there are points that are a) outliers AND b) highly influential, these have potential to change the inference. You should consider removing them.
How do we make heads or tails out of the plots above? One way is with an influence plot.
#make influence plot
influencePlot(prestige.mod1, id.n=3)
## StudRes Hat CookD
## ministers 2.4211475 0.10072159 0.15638036
## collectors -2.5320314 0.01352116 0.02081914
## newsboys -0.3534964 0.28595843 0.01262364
## babysitters 1.7227767 0.19703063 0.17848347
## farmers 2.8574618 0.03896096 0.07711591
Another diagnostic is to test for heteroskedasticity (i.e., the variance of the error term is not constant).
#test for heteroskedasticity
ncvTest(prestige.mod1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.02841389, Df = 1, p = 0.86614
We also want to look for multicollinearity, that is are some of our independent variables highly correlated. We do this by looking at the Variance Inflation Factor (VIF). A GVIF > 4 suggests collinearity.
vif(prestige.mod1)
## education log2(income) women
## 1.877097 2.572283 1.806431
olsrr
packageAnother helpful R package for linear regression models is the
olsrr
package.
We can use the ols_vif_tol()
function to also compute
the VIF’s and tolerances (TOL) for each predictor in the model. TOL is
the inverse of the VIF.
library(olsrr)
ols_vif_tol(prestige.mod1)
## Variables Tolerance VIF
## 1 education 0.5327376 1.877097
## 2 log2(income) 0.3887597 2.572283
## 3 women 0.5535778 1.806431
Another good measure of multicollinearity is the condition index. You want the condition index < 30. When you look at the output below, only look at the LAST LINE of the output.
ols_eigen_cindex(prestige.mod1)
## Eigenvalue Condition Index intercept education log2(income)
## 1 3.4928045271 1.000000 0.0001776490 0.002498313 0.0001405901
## 2 0.4705124616 2.724593 0.0002315624 0.003495553 0.0002925430
## 3 0.0356933217 9.892218 0.0149794957 0.595229304 0.0048739025
## 4 0.0009896896 59.407002 0.9846112929 0.398776831 0.9946929645
## women
## 1 0.0150905237
## 2 0.5254459989
## 3 0.0009712806
## 4 0.4584921968
The condition index for Model 1 was 59 which is >> 30. When we removed income in Model 2 was the condition index lower? and look at the VIFs for Model 2.
ols_vif_tol(prestige.mod2)
## Variables Tolerance VIF
## 1 education 0.6995808 1.429427
## 2 log2(income) 0.6995808 1.429427
ols_eigen_cindex(prestige.mod2)
## Eigenvalue Condition Index intercept education log2(income)
## 1 2.96242154 1.000000 0.0004671053 0.004676692 0.0003647341
## 2 0.03575405 9.102502 0.0274171118 0.777086884 0.0091862089
## 3 0.00182441 40.296029 0.9721157829 0.218236424 0.9904490570
It is lower, but the index is still > 30; but the VIFs are all < 2. This is probably due to education and income being correlated with each other - they are not independent of each other…