Linear Regression

We start first by loading up the following packages: HistData, car, and stargazer.

#load up necessary packages
library(HistData)
library(car)
library(stargazer)

Understanding Linear Regresson: The Univariate Case

We want to determine a straight line that determines the relationship between x, some independent (predictor) variable, and y, an outcome or dependent variable.

You will recall from high school algebra that the equation for a straight line is:

\[\begin{equation} y = \alpha + \beta x \end{equation}\]

where \(\alpha\) is the y-intercept (the value of y when x=0) and \(\beta\) is the slope of the line (the amount that y will increase by with every unit increase in x).

A classic example is the height data from Sir Francis Galton, a 19th century statistician. He collected height data on 905 children born to 205 parents. The dataset is part of the R package HistData, which contains Data Sets from the History of Statistics and Data Visualization. The variables are the height of the child (as an adult) and the mid-parent height, that is the average of the parents’ heights.

If you go to the Environment tab, you will see the Global Environment. Click on “package:HistData” and you will see the Galton dataset.

Let’s see what is in there:

# see what is in the Galton dataset
summary(Galton)
##      parent          child      
##  Min.   :64.00   Min.   :61.70  
##  1st Qu.:67.50   1st Qu.:66.20  
##  Median :68.50   Median :68.20  
##  Mean   :68.31   Mean   :68.09  
##  3rd Qu.:69.50   3rd Qu.:70.20  
##  Max.   :73.00   Max.   :73.70

Simple linear regression model

Split data into training and testing

Let’s develop the model on ~90% of the dataset, then test it on the remaining ~10% of the data. In the datascience world, we call that first step “training” the model.

# divide the dataset into a training 
# and a testing set based on a 
# random uniform number on fixed seed

# set a "seed" for the random number generator
set.seed(20170208)

# create a series of random numbers from
# 0 to 1 using a uniform distribution function
Galton$group <- runif(length(Galton$parent), 
                      min = 0, max = 1)
summary(Galton)
##      parent          child           group         
##  Min.   :64.00   Min.   :61.70   Min.   :0.002463  
##  1st Qu.:67.50   1st Qu.:66.20   1st Qu.:0.245599  
##  Median :68.50   Median :68.20   Median :0.518798  
##  Mean   :68.31   Mean   :68.09   Mean   :0.501614  
##  3rd Qu.:69.50   3rd Qu.:70.20   3rd Qu.:0.743939  
##  Max.   :73.00   Max.   :73.70   Max.   :0.999420
# group should look approximately uniform
hist(Galton$group)

# using the new random number, split
# the file: use 90% to train and 10% to test
Galton.train <- subset(Galton, group <= 0.90)
Galton.test <- subset(Galton, group > 0.90)

Look at some plots - visualize your data

Let’s look at a plot of the children’s heights (Y-axis) versus the average height of the parents (X-axis).

Now let’s graph the training dataset (90%) and the test set data (the smaller 10%).

# graph child on parent for training dataset
plot(child ~ parent, data = Galton.train,
     main = "Galton Training Dataset: Children vs Parents Heights")

# graph child on parent for testing dataset
plot(child ~ parent, data = Galton.test,
     main = "Galton Test Dataset: Children vs Parents Heights")

Perform (fit) linear regression model

Let’s do the regression now on the training set. Linear regression is performed with the R function “lm” and takes the form

object.name <- lm(y ~ x, data = data_set_name)

Let’s do that now with the Galton data - using the training dataset.

Save the regression output as reg1.

Look at reg1.

# linear regression of child height on 
# mid-parent height in the training dataset
reg1 <- lm(child ~ parent, data = Galton.train)
reg1
## 
## Call:
## lm(formula = child ~ parent, data = Galton.train)
## 
## Coefficients:
## (Intercept)       parent  
##     23.8537       0.6474

The output is really simple.

Let’s do a summary() of the regression output summary(reg1).

# model summary
summary(reg1)
## 
## Call:
## lm(formula = child ~ parent, data = Galton.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7923 -1.3502  0.0604  1.6498  5.9445 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.85366    2.99886   7.954  5.9e-15 ***
## parent       0.64736    0.04389  14.751  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.249 on 829 degrees of freedom
## Multiple R-squared:  0.2079, Adjusted R-squared:  0.2069 
## F-statistic: 217.6 on 1 and 829 DF,  p-value: < 2.2e-16
# save model summary details
sreg1 <- summary(reg1)

Much more detailed output.

Interpretation: for each increase in parent height of 1 inch, the child height increases by 0.65 inches.

Now the way that this is working is to estimate values for \(\alpha\) and \(\beta\) such that when you plug in your given independent variables, you get predicted dependent variables that are close to the observed values. In statistics we optimize this closeness by minimizing the sum-of-squared-residuals, that is

\[\begin{equation} \sum\limits_{i=1}^n (Y_{obs.i} - Y_{pred.i})^2 \end{equation}\]

Model evaluation

So, let’s look at how we did.

  • First let’s calculate the observed and predicted values in the training and testing datasets. Use the predict() function.
  • Then compute the residuals (how far off was the model predictions from the original values) = original value - predicted value.
# get predicted values in the training and testing dataset
Galton.train$pred.child <- 
  predict(reg1, newdata = Galton.train)
Galton.test$pred.child <- 
  predict(reg1, newdata=Galton.test)

# calculate residuals in the training and testing dataset
Galton.train$resid <- 
  Galton.train$child - Galton.train$pred.child
Galton.test$resid <- 
  Galton.test$child - Galton.test$pred.child

Now that we have calculated these values, let’s look at some simple plots. The Companion to Applied Regression (aka car) package, has some good functionality for this:

library(car)

#get the residual plots
residualPlots(reg1)

##            Test stat Pr(>|Test stat|)
## parent        1.4592           0.1449
## Tukey test    1.4592           0.1445

Let’s look at how well we do in the testing dataset. First let’s plot the TEST data.

Base R approach

#plot test dataset
plot(child ~ parent, data = Galton.test)

#overlay the regression line
abline(a=23.85, b=0.65)

ggplot2 approach

library(ggplot2)
ggplot(Galton.test, aes(x=parent, y=child)) +
  geom_point() +
  geom_smooth(method = "lm")

Now let’s look at the residual plots:

Residuals versus “X”

Do the residuals have any pattern with the value of the predictor?

#plot residual versus X
plot(resid ~ parent, data = Galton.test)

#overlay the zero line
abline(0,0)

Residuals versus Predicted Values

Do the residuals show any pattern with the predicted values?

#plot residual v predicted
plot(resid ~ pred.child, data = Galton.test)

One assumption in the statistical analysis of a linear regression are the hypothesis tests. We usually want to test a null hypothesis that the slope is 0, i.e., there is no relationship between y and x. We express this mathematically as:

\[\begin{equation} H_0: \beta = 0 \end{equation}\]

We test a null hypothesis against an alternative hypothesis, ie, the slope is not equal to 0, i.e., there is some (positive or negative) relationship between y and x. We express this mathematically as:

\[\begin{equation} H_A: \beta \neq 0 \end{equation}\]

As you have seen in these data, the y-values do not line up as a perfect linear function of x, i.e. child height does not line up as a perfect linear function of parent height. There is an error that occurs for each observation. So we are really modeling a statistical model

\[\begin{equation} Y = \alpha + \beta x + \epsilon \end{equation}\]

Recall from above that we estimate \(\alpha\) and \(\beta\) to minimize the sum of squared residuals.

To do a statistical test on the slopes we have to make some assumptions. One of the assumptions is that \(\epsilon\) ~ N(0, \(\sigma^2\)), that is, that the errors between the observed and predicted values of Y take on a normal distribution with mean of 0 and variance of \(\sigma^2\).

We can formally test this assumption, but we can also do a qq-plot which will give us a visual representation of the same. We get this plot as follows:

Histogram and Q-Q Plot of residuals

Do the residuals look normally distributed (from the model fit to the training data)?

# base R approach
hist(Galton.train$resid)

qqnorm(Galton.train$resid)

# using car package qqPlot()
# put in the model object
qqPlot(reg1)

##  1 13 
##  1 11

In this last plot (from qqPlot(reg1) using the car package) we have taken each residual and divided by the estimated standard deviation to create studentized residuals. We then rank them and calculate the percentile for each studentized residual, then create this graph. Of course, the car package is doing all of this under the hood, so to speak.

To do the significance test, recall the summary of the linear regression model we had above.

## 
## Call:
## lm(formula = child ~ parent, data = Galton.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7923 -1.3502  0.0604  1.6498  5.9445 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.85366    2.99886   7.954  5.9e-15 ***
## parent       0.64736    0.04389  14.751  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.249 on 829 degrees of freedom
## Multiple R-squared:  0.2079, Adjusted R-squared:  0.2069 
## F-statistic: 217.6 on 1 and 829 DF,  p-value: < 2.2e-16

Note that there are two estimated coefficients, so our estimated line is

\[\begin{equation} child = 23.85 + 0.65*parent \end{equation}\]

You will also notice that next to the estimate, there is a column for standard error, a column for t-value, and column for p-value. We divide the estimate by the standard error to compute the t-value, which then has 829 degrees of freedom. For the slope estimate, we calculate the the p-value is < 2e-16, i.e., highly significantly different from 0.

At the bottom of summary you will notice a line for the F-statistics, which is the ratio of the mean-squared error of the model to the mean-squared error of the residuals. This has an F-distribution with 1 and 829 degrees of freedom, and takes on the value of 217.6 which has a p-value < 2e-16. Since this is a univariate regression you will see that if you take the value of the t-statistic for the slope and square it, that will give you the value of the F-statistic.

Isn’t math fun?

Understanding Linear Regresson: The Multivariate Case

Suppose you have more than one independent variable that you think explains the dependent variable. That is instead of the simple univariate case of

\[\begin{equation} y = \alpha + \beta x \end{equation}\]

You have the multivariate case of

\[\begin{equation} y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \end{equation}\]

As an example consider the PRESTIGE dataset (which comes with the car package). It consists of 102 observations and 6 variables as follows:

education: The average number of years of education for occupational incumbents.
income: The average income of occupational incumbents, in dollars.
women: The percentage of women in the occupation.
prestige: The average prestige rating for the occupation.
census: The code of the occupation used in the survey.
type: Professional and managerial(prof), white collar(wc), blue collar(bc), or missing(NA).

We want to determine how the prestige of an occupation is related to average income, education (average number of years for incumbents), and the percentage of women in the occupation.

Let’s explore first:

#explore the 5 m's for the Prestige dataset
summary(Prestige)
##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

Now let’s see some plots:

# produce a scatterplot matrix 
# using the function from the car package
# note: span adjusts the amount of smoothing
car::scatterplotMatrix(~ prestige + income +
                         education + women,
                       span = 0.7,
                       data = Prestige)

Hmmm…the variable income looks kind of wonky, for lack of a better term. So let’s see if a transformation will help:

Let’s use log2() function to transform income.

NOTE: there are multiple options for the log() function:

  • log() is the “natural” log - base “e” (exp(1) = 2.718282…)
  • log10() is log - base 10 (think decibels or Richter Scale)
  • log2() is the “natural” log - base 2
car::scatterplotMatrix( ~ prestige + log2(income) +
                          education + women,
                        span = 0.7,
                        data = Prestige)

Well, that looks a little better.

Now let’s see what the regression of prestige on income (logged), education, and women looks like:

# let the regression rip
prestige.mod1 <- lm(prestige ~ education + 
                      log2(income) + women, 
                    data = Prestige)

#and see what we have
summary(prestige.mod1)
## 
## Call:
## lm(formula = prestige ~ education + log2(income) + women, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.364  -4.429  -0.101   4.316  19.179 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -110.9658    14.8429  -7.476 3.27e-11 ***
## education       3.7305     0.3544  10.527  < 2e-16 ***
## log2(income)    9.3147     1.3265   7.022 2.90e-10 ***
## women           0.0469     0.0299   1.568     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.093 on 98 degrees of freedom
## Multiple R-squared:  0.8351, Adjusted R-squared:   0.83 
## F-statistic: 165.4 on 3 and 98 DF,  p-value: < 2.2e-16

OPTIONS for Regression Model Output Using Rmarkdown

Some helpful packages for formatting regression models using Rmarkdown:

sjPlot package

Good News: This works very well knitting to HTML - once created you can cut and paste from HTML into MS WORD DOC format.

Bad News: This does NOT knit directly to DOC or PDF.

library(sjPlot)
sjPlot::tab_model(prestige.mod1)
  prestige
Predictors Estimates CI p
(Intercept) -110.97 -140.42 – -81.51 <0.001
education 3.73 3.03 – 4.43 <0.001
income [log2] 9.31 6.68 – 11.95 <0.001
women 0.05 -0.01 – 0.11 0.120
Observations 102
R2 / R2 adjusted 0.835 / 0.830

apaTables package

Bad News: This creates a file OUTSIDE of the Rmarkdown file - the file is saved in your current R project directory (given the code below).

Good News: The cool thing is the output file is ready to go in MS WORD (*.DOC) format!!

library(apaTables)
apa.reg.table(prestige.mod1, 
              filename = "Table2_APA.doc", 
              table.number = 2)
## 
## 
## Table 2 
## 
## Regression results using prestige as the criterion
##  
## 
##     Predictor         b          b_95%_CI beta   beta_95%_CI sr2  sr2_95%_CI
##   (Intercept) -110.97** [-140.42, -81.51]                                   
##     education    3.73**      [3.03, 4.43] 0.59  [0.48, 0.70] .19  [.10, .27]
##  log2(income)    9.31**     [6.68, 11.95] 0.46  [0.33, 0.59] .08  [.03, .13]
##         women      0.05     [-0.01, 0.11] 0.09 [-0.02, 0.20] .00 [-.01, .01]
##                                                                             
##                                                                             
##                                                                             
##      r             Fit
##                       
##  .85**                
##  .74**                
##   -.12                
##            R2 = .835**
##        95% CI[.77,.87]
##                       
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
## 

Sneak Peek - image of WORD DOC table created

Table 2 APA DOC file

Rmarkdown friendly output with gtsummary package

This is probably the MOST flexible since you can KNIT directly to HTML, DOC or PDF. See more details at https://www.danieldsjoberg.com/gtsummary/articles/rmarkdown.html.

library(gtsummary)
tbl_mod1 <- tbl_regression(prestige.mod1)
tbl_mod1
Characteristic Beta 95% CI1 p-value
education 3.7 3.0, 4.4 <0.001
log2(income) 9.3 6.7, 12 <0.001
women 0.05 -0.01, 0.11 0.12
1 CI = Confidence Interval

From this output we see that if education increases by 1 year, then average prestige rating will increase by 3.73 units, with income and women held constant.

For education we note that the p-value is < 2e-16.

So we interpret that to reject the null hypothesis H_0: \(\beta_1\) = 0.

How do interpret the contributions of log2(income) and women?

Another point to remark: the multiple R-squared is 0.84, indicating that 84% of the variability in average prestige rating is due to these three independent variables.

Finally, the F-statistic of 165.4 is for testing the null hypothesis H_0: \(\beta_1 = \beta_2 = \beta_3 = 0\). It is highly significant. (and that “squaring the t to get the F” trick only works for the univariate regression case…)

Since the p-value for women does not reach the traditional significance level of 0.05, we might consider removing it from our model. Let’s see what happens then…

#run the model with only two predictors
prestige.mod2 <- lm(prestige ~ education + 
                      log2(income), data= Prestige)

# look at the results
summary(prestige.mod2)
## 
## Call:
## lm(formula = prestige ~ education + log2(income), data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0346  -4.5657  -0.1857   4.0577  18.1270 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -95.1940    10.9979  -8.656 9.27e-14 ***
## education      4.0020     0.3115  12.846  < 2e-16 ***
## log2(income)   7.9278     0.9961   7.959 2.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.145 on 99 degrees of freedom
## Multiple R-squared:  0.831,  Adjusted R-squared:  0.8275 
## F-statistic: 243.3 on 2 and 99 DF,  p-value: < 2.2e-16

Rmarkdown friendly output with gtsummary package

tbl_mod2 <- tbl_regression(prestige.mod2)
tbl_mod2
Characteristic Beta 95% CI1 p-value
education 4.0 3.4, 4.6 <0.001
log2(income) 7.9 6.0, 9.9 <0.001
1 CI = Confidence Interval

Hmmmm….education and log2(income) remain highly significant, there is little reduction in R-squared, the model is still significant.

stargazer package

We can compare the results of these two analyses a little more cleanly using the stargazer package as follows:

NOTE: stargazer works well if you are “knitting” to either HTML or PDF. This will NOT work is you are “knitting” to WORD (DOCX) formats.

The code below is for the HTML output. Change to PDF of you want that output.

# compare the results of the two regression models
stargazer(prestige.mod1, prestige.mod2, 
          title="Comparison of 2 Regression outputs",
          type = "html", align=TRUE)
Comparison of 2 Regression outputs
Dependent variable:
prestige
(1) (2)
education 3.731*** 4.002***
(0.354) (0.312)
log2(income) 9.315*** 7.928***
(1.327) (0.996)
women 0.047
(0.030)
Constant -110.966*** -95.194***
(14.843) (10.998)
Observations 102 102
R2 0.835 0.831
Adjusted R2 0.830 0.828
Residual Std. Error 7.093 (df = 98) 7.145 (df = 99)
F Statistic 165.428*** (df = 3; 98) 243.323*** (df = 2; 99)
Note: p<0.1; p<0.05; p<0.01

gtsummary merged tables

This is a little friendlier to different knitted outputs but does not include the model comparison statistics.

tbl_mod12 <-
  tbl_merge(
    tbls = list(tbl_mod1, tbl_mod2), 
    tab_spanner = c("**Model 1**", "**Model 2**") 
  )
tbl_mod12
Characteristic Model 1 Model 2
Beta 95% CI1 p-value Beta 95% CI1 p-value
education 3.7 3.0, 4.4 <0.001 4.0 3.4, 4.6 <0.001
log2(income) 9.3 6.7, 12 <0.001 7.9 6.0, 9.9 <0.001
women 0.05 -0.01, 0.11 0.12
1 CI = Confidence Interval

We conclude that we lose little by eliminating a predictor variable. This is also consistent with a principle of parsimony, also known as the KISS principle, or “Keep It Simple, Silly”.

Let’s finish with some diagnostics. First the plots for the firs model with 3 independent variables:

# diagnostics for the first model 
# with 3 independent variables
# function from car package
residualPlots(prestige.mod1)

##              Test stat Pr(>|Test stat|)  
## education       1.7555          0.08233 .
## log2(income)    0.5851          0.55981  
## women           2.2770          0.02499 *
## Tukey test      0.7630          0.44545  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that there is a non-zero trend for the variable women.

And now the plots for the second model with 2 independent variables

# diagnostics for the second model 
# with 2 independent variables
residualPlots(prestige.mod2)

##              Test stat Pr(>|Test stat|)
## education       1.6153           0.1095
## log2(income)    0.2213           0.8253
## Tukey test      0.6530           0.5137

Now the plots look really good, much better.

Another diagnostic tool is the added variable plot, that is the additional benefit of variable i given that all of the others are in. In this particular plot we can also identify the most influential observations.

#added variable plots - car function
avPlots(prestige.mod1, id.n=2, id.cex=0.7)

#id.n - identify n most influential observations
#id.cex - controls the size of the dot

Let’s run the qq-plot for model 1:

# run the qq-plot
qqPlot(prestige.mod1, id.n=3)

## collectors    farmers 
##         46         67
# here, id.n identifies the n observations with the largest residuals in absolute value

Are there any outliers?

We can also run a Bonferroni test for outliers.

#run Bonferroni test for outliers
outlierTest(prestige.mod1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##         rstudent unadjusted p-value Bonferroni p
## farmers 2.857462          0.0052259      0.53305

Are there any points that are of high influence?

#identify highly influential points
influenceIndexPlot(prestige.mod1, id.n=3)

NB. If there are points that are a) outliers AND b) highly influential, these have potential to change the inference. You should consider removing them.

How do we make heads or tails out of the plots above? One way is with an influence plot.

#make influence plot
influencePlot(prestige.mod1, id.n=3)

##                StudRes        Hat      CookD
## ministers    2.4211475 0.10072159 0.15638036
## collectors  -2.5320314 0.01352116 0.02081914
## newsboys    -0.3534964 0.28595843 0.01262364
## babysitters  1.7227767 0.19703063 0.17848347
## farmers      2.8574618 0.03896096 0.07711591

Another diagnostic is to test for heteroskedasticity (i.e., the variance of the error term is not constant).

#test for heteroskedasticity
ncvTest(prestige.mod1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.02841389, Df = 1, p = 0.86614

We also want to look for multicollinearity, that is are some of our independent variables highly correlated. We do this by looking at the Variance Inflation Factor (VIF). A GVIF > 4 suggests collinearity.

vif(prestige.mod1)
##    education log2(income)        women 
##     1.877097     2.572283     1.806431

olsrr package

Another helpful R package for linear regression models is the olsrr package.

We can use the ols_vif_tol() function to also compute the VIF’s and tolerances (TOL) for each predictor in the model. TOL is the inverse of the VIF.

library(olsrr)
ols_vif_tol(prestige.mod1)
##      Variables Tolerance      VIF
## 1    education 0.5327376 1.877097
## 2 log2(income) 0.3887597 2.572283
## 3        women 0.5535778 1.806431

Another good measure of multicollinearity is the condition index. You want the condition index < 30. When you look at the output below, only look at the LAST LINE of the output.

ols_eigen_cindex(prestige.mod1)
##     Eigenvalue Condition Index    intercept   education log2(income)
## 1 3.4928045271        1.000000 0.0001776490 0.002498313 0.0001405901
## 2 0.4705124616        2.724593 0.0002315624 0.003495553 0.0002925430
## 3 0.0356933217        9.892218 0.0149794957 0.595229304 0.0048739025
## 4 0.0009896896       59.407002 0.9846112929 0.398776831 0.9946929645
##          women
## 1 0.0150905237
## 2 0.5254459989
## 3 0.0009712806
## 4 0.4584921968

The condition index for Model 1 was 59 which is >> 30. When we removed income in Model 2 was the condition index lower? and look at the VIFs for Model 2.

ols_vif_tol(prestige.mod2)
##      Variables Tolerance      VIF
## 1    education 0.6995808 1.429427
## 2 log2(income) 0.6995808 1.429427
ols_eigen_cindex(prestige.mod2)
##   Eigenvalue Condition Index    intercept   education log2(income)
## 1 2.96242154        1.000000 0.0004671053 0.004676692 0.0003647341
## 2 0.03575405        9.102502 0.0274171118 0.777086884 0.0091862089
## 3 0.00182441       40.296029 0.9721157829 0.218236424 0.9904490570

It is lower, but the index is still > 30; but the VIFs are all < 2. This is probably due to education and income being correlated with each other - they are not independent of each other…