Let’s load the Penguins data from the palmerpenguins
package. Go ahead and also load the dplyr and
ggplot2 packages as well. Other packages are also loaded in
the R chunks below as needed.
R Packages Used for this Example on Linear and Logistic Regression:
palmerpenguinsdplyrggplot2easystatsgtsummarycarbroomolsrrROCRDescTools# load packages needed
library(dplyr)
library(palmerpenguins)
library(ggplot2)
For this homework ASSIGNMENT, let’s pull out the data for the
Chinstrap penguins, and we’ll remove the
NA’s here at the beginning for the penguins with missing
body size measurements.
pg1 <- penguins %>%
select(species, body_mass_g, bill_depth_mm,
bill_length_mm, flipper_length_mm) %>%
filter(species == "Chinstrap") %>%
filter(complete.cases(.))
Before we build the regression model, let’s take a moment and look at the bivariate correlations between body mass and the body dimensions for bill depth and bill and flipper length. We’ll put body mass in the 1st column since this is the variable we’ll focus on for the linear regression.
library(easystats)
pg1 %>%
select(body_mass_g, bill_depth_mm,
bill_length_mm, flipper_length_mm) %>%
correlation() %>%
summary(redundant = TRUE) %>%
display()
| Parameter | body_mass_g | bill_depth_mm | bill_length_mm | flipper_length_mm |
|---|---|---|---|---|
| body_mass_g | 0.60*** | 0.51*** | 0.64*** | |
| bill_depth_mm | 0.60*** | 0.65*** | 0.58*** | |
| bill_length_mm | 0.51*** | 0.65*** | 0.47*** | |
| flipper_length_mm | 0.64*** | 0.58*** | 0.47*** |
p-value adjustment method: Holm (1979)
=========================================
HOMEWORK QUESTION #1: What do you notice about the correlations of bill depth, bill length and flipper length with body mass? Are these smaller or larger here for the Chinstrap penguins than what we saw in the EXAMPLE template report for the Gentoo penguins?
ANSWER TO QUESTION 1 HERE:
=========================================
body_mass_g from other 3 dimensional
measurementsNOTE: Since we are focusing for this example on model interpretation, let’s fit the model to all available data. For now, we will not split the data into a training and testing datasets.
Let’s fit the lm linear model.
lm1 <- lm(body_mass_g ~ bill_depth_mm +
bill_length_mm + flipper_length_mm,
data = pg1)
To view and format the model results in a nice table format, let’s
use the gtsummary package and the
tbl_regression() function. I have also added some other
formatting features from the package. To learn more, see https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html.
NOTE: add_glance_source_note() function comes from the
broom::glance() function from the broom
package which is loaded with gtsummary. See https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html#gtsummary-functions-to-add-information.
library(gtsummary)
lm1 %>%
tbl_regression(
estimate_fun = ~ style_number(.x, digits = 2)
) %>%
add_glance_source_note(
include = c(r.squared, adj.r.squared)
)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| bill_depth_mm | 91.51 | 5.37, 177.65 | 0.038 |
| bill_length_mm | 16.04 | -11.01, 43.09 | 0.2 |
| flipper_length_mm | 22.58 | 10.81, 34.35 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.504; Adjusted R² = 0.481 | |||
=========================================
HOMEWORK QUESTION #2: What do you notice about the fit of this model (R2 and Adjusted R2) here for the Chinstrap penguins body mass model than what we saw in the EXAMPLE template report for the Gentoo penguins?
ANSWER TO QUESTION 2 HERE:
=========================================
One of the assumptions of linear regression is that the residuals should be normally distributed. So, let’s make a histogram of the residuals to look for any non-normal tendencies like skewness and let’s also look for any non-linear trends or outliers with the normal probability plot - the QQ plot.
Look at the QQ plot - do you notice any potential outliers? If so, what are their numbers?
# histogram of the residuals
hist(lm1$residuals)
# get QQ plot of residuals
# the normal probability plot
library(car)
car::qqPlot(lm1)
## [1] 20 39
=========================================
HOMEWORK QUESTION #3: From the Histogram and QQ plot, do the residuals look normally distributed and do you notice any outliers - be sure to look at both low and high values? If so, which cases?
ANSWER TO QUESTION 3 HERE:
=========================================
In the car package, we can also use the
residualPlots() function to look for any non-linear trends
in the plots of the residuals against each predictor. We are looking for
anything that deviates from Y=0 a horizontal line through 0. Look for
any lines that trend up or down over the range of the predictor. Also
look for any curvature in the lines.
After the plots, a table is produced showing the Tukey’s test for nonadditivity. If the test is significant (p<.05) for any given predictor, then that predictor may not meet the assumption of a linear relationship to the outcome variable - in other words, you may need to make this predictor variable quadratic or higher order like \(X^2\) or \(X^3\).
Review the reported test statistics and look at the plots - do any variables have significant p-value or show significant curvature in the plot?
Learn more in the BOOK: “Companion for Applied Regression” by John
Fox (who also wrote the car package for R), see Chapter 6
https://us.sagepub.com/sites/default/files/upm-binaries/38503_Chapter6.pdf.
# diagnostics for the first model
# with 3 independent variables
# function from car package
car::residualPlots(lm1)
## Test stat Pr(>|Test stat|)
## bill_depth_mm -0.2092 0.8349
## bill_length_mm 0.6131 0.5420
## flipper_length_mm 1.4469 0.1529
## Tukey test 1.5812 0.1138
=========================================
HOMEWORK QUESTION #4: From these residual plots, do any of the variables show significant curvature? remember to look at the table of statistics for each variable for the curvature tests.
ANSWER TO QUESTION 4 HERE:
=========================================
broom package to review other model outputNext we can use the broom package https://broom.tidymodels.org/,
which is useful for looking at model effects, predictions and residuals
and other related statistics and fit metrics.
We’re going to use the augment() function to pull out
the .fitted variable for the predicted values from the
model. Then we’re going to compare the predicted values to the original
values.
library(broom)
alm1 <- augment(lm1)
Before we make the plot, let’s also look at the summary statistics
(mean (SD) and the min and max) and compare the original values and the
model .fitted values.
What do you notice?
.fitted
values for the original data versus the .fitted values.fitted values
to the original values.alm1 %>%
select(body_mass_g, .fitted) %>%
tbl_summary(statistic = list(
all_continuous() ~ "{mean} ({sd}); [{min} - {max}]"
),
digits = all_continuous() ~ 2,) %>%
modify_footnote(
all_stat_cols() ~ "Mean (SD) [min - max]"
)
| Characteristic | N = 681 |
|---|---|
| body_mass_g | 3,733.09 (384.34); [2,700.00 - 4,800.00] |
| .fitted | 3,733.09 (272.80); [3,192.64 - 4,312.57] |
| 1 Mean (SD) [min - max] | |
=========================================
HOMEWORK QUESTION #5: From these descriptive statistics comparing the original body mass values and the predicted values from the model, how well do the values line up?
ANSWER TO QUESTION 5 HERE:
=========================================
Note: To see how well the values match up, be sure to plot the x and y axis on same limits and add Y=X reference line. So, what do you see - how good are the predictions?
# find min and max of original body_mass_g
# and .fitted values
lower <- min(alm1$body_mass_g, alm1$.fitted)
upper <- max(alm1$body_mass_g, alm1$.fitted)
ggplot(alm1,
aes(x = body_mass_g,
y = .fitted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1,
color = "red") +
xlim(lower, upper) +
ylim(lower, upper) +
xlab("Original Body Mass (g)") +
ylab("Model Predicted Body Mass (g)") +
ggtitle("Regression Diagnostic Plot - Fitted vs Original")
=========================================
HOMEWORK QUESTION #6: From the plot above, how well do you think the model is doing in predicting body mass for the Chinstrap penguins?
ANSWER TO QUESTION 6 HERE:
=========================================
Another helpful set of diagnostic plots are looking at the added variable plots. Each of these plots looks at how well the given variable is correlated with (predicts) the outcome after adjusting for the other variables already included in the model.
Look at the plots and see if any show a horizontal line indicating no “added information” for that variable. Also see if any potential outliers are highlighted.
car::avPlots(lm1)
=========================================
HOMEWORK QUESTION #7: Based on the added variable plots above, do you think any predictor variables should be removed from the model?
ANSWER TO QUESTION 7 HERE:
=========================================
If we want to identify outliers in the dataset, we can run the
car::outlierTest() function to find sample cases in the
dataset that might be an unusual case which may be affecting the model
fit.
From the output below, look for any adjusted Bonferroni p-values < 0.05.
#run Bonferroni test for outliers
car::outlierTest(lm1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 39 -2.954897 0.0043947 0.29884
=========================================
HOMEWORK QUESTION #8: Based on the outlier test above, are any cases identified as a possible outlier? (look at the Bonferroni p-values)
ANSWER TO QUESTION 8 HERE:
=========================================
We can also look at other plots looking for cases that may be outliers or may have extra influence on the final fit of the model. The following “Influence Index Plots” shows plots for:
See Section 6.3 for “Unusual Data” in Chapter 6 https://us.sagepub.com/sites/default/files/upm-binaries/38503_Chapter6.pdf of the Companion for Applied Regression BOOK.
#identify highly influential points
influenceIndexPlot(lm1)
Another kind of influence plot looks at the points that may both be outliers and also have extra influence on the fit of the model (looking at Studentized Residuals against the Hat-values).
To learn more, see Figure 6.6 in https://us.sagepub.com/sites/default/files/upm-binaries/38503_Chapter6.pdf.
#make influence plot
influencePlot(lm1)
## StudRes Hat CookD
## 7 -0.06438301 0.14682762 0.0001811609
## 18 0.97488266 0.38765159 0.1505304688
## 20 2.66675846 0.01676663 0.0276749399
## 39 -2.95489702 0.06114175 0.1268332456
=========================================
HOMEWORK QUESTION #9: Based on the 2 plots above, do any cases show up as possible outliers and/or as having extra influence or leverage on the fit of the model?
ANSWER TO QUESTION 9 HERE:
=========================================
Another assumption of linear regression is that the variance in the outcome is equal across all levels of the predictors. One way to check this is to run the test for heteroskedasticity.
We are looking for a non-significant p-value so we cannot reject the assumption of equal variance. Was the test significant or not?
# test for heteroskedasticity
ncvTest(lm1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.0077973, Df = 1, p = 0.92964
=========================================
HOMEWORK QUESTION #10: Based on the test for heteroskedasticity above, is this assumption met?
ANSWER TO QUESTION 10 HERE:
=========================================
Another assumption for linear regression (really any regression model) is the assumption that the predictors are independent of one another. To check this, we can compute the VIF (variance inflation factors) for the predictors. Typically we want the VIFs < 10 and ideally want them < 5. I actually start to get worried about non-independence if the VIFs get much larger > than 2.
# we want these values < 10 or < 5
vif(lm1)
## bill_depth_mm bill_length_mm flipper_length_mm
## 2.092955 1.785702 1.541994
OPTIONAL - Also use olsrr package
Even after reviewing the VIFs, it is also a good idea to check that the Condition Index (see line 4 of the Eigenvalues) is < 30. Was the condition index < 30 or not?
library(olsrr)
# get VIFs and Tolerances
ols_vif_tol(lm1)
## Variables Tolerance VIF
## 1 bill_depth_mm 0.4777934 2.092955
## 2 bill_length_mm 0.5600039 1.785702
## 3 flipper_length_mm 0.6485110 1.541994
# Get condition index
ols_eigen_cindex(lm1)
## Eigenvalue Condition Index intercept bill_depth_mm bill_length_mm
## 1 3.9953496064 1.00000 0.0000812136 0.0001115001 1.606237e-04
## 2 0.0027102665 38.39471 0.1726656078 0.0573426801 4.071088e-01
## 3 0.0013885690 53.64060 0.0588867582 0.7898132114 5.926962e-01
## 4 0.0005515581 85.11023 0.7683664204 0.1527326084 3.435365e-05
## flipper_length_mm
## 1 0.0000529897
## 2 0.0382929381
## 3 0.0006395062
## 4 0.9610145660
=========================================
HOMEWORK QUESTION #11: Based on the VIFs and condition index above, do any variables show up as possibly non-independent, i.e. is there significant multicollinearity?
ANSWER TO QUESTION 11 HERE:
=========================================
Let’s load the Penguins data. For this homework ASSIGNMENT, let’s pull out the data for the Adelie penguins.
Let’s use logistic regression to see how well using these same 3 dimensional measurements (bill depth, bill and flipper length measurements) differentiates the male and female penguins.
NOTE: We need to do a filter here since there are penguins missing
biological sex. So, let’s re-select the data and remove all NAs
including those for sex to have a complete dataset before
we fit the models. It is important to remove the NAs
now, since they will cause errors for some of the later prediction
functions.
pg2 <- penguins %>%
select(species, sex, bill_depth_mm,
bill_length_mm, flipper_length_mm) %>%
filter(species == "Adelie") %>%
filter(complete.cases(.))
Similar to performing bivariate analyses of each potential predictor with the outcome for a linear regression, to evaluate potential predictors for a logistic regression model it is useful to run t-tests for continuous predictors (or chi-square tests for any categorical predictors) for the 2 categories for the logistic regression outcome (male vs female biological sex for the penguins in this example).
I added custom output below and ran a two-sample t-test using equal variance assumption for the 2-group tests below. Notice that all of the tests are significant with the Males having much larger dimensions than the Females.
library(gtsummary)
pg2 %>%
tbl_summary(
by = sex,
include = c(bill_depth_mm,
bill_length_mm,
flipper_length_mm),
type = all_continuous() ~ "continuous2",
statistic = list(
all_continuous() ~ c(
"{mean} ({sd})",
"{median} ({p25}, {p75})",
"[{min} - {max}]"
)
),
digits = all_continuous() ~ 2
) %>%
add_p(
test = everything() ~ "t.test",
test.args = all_tests("t.test") ~ list(var.equal = TRUE)
)
| Characteristic | female N = 73 |
male N = 73 |
p-value1 |
|---|---|---|---|
| bill_depth_mm | <0.001 | ||
| Mean (SD) | 17.62 (0.94) | 19.07 (1.02) | |
| Median (Q1, Q3) | 17.60 (17.00, 18.30) | 18.90 (18.50, 19.60) | |
| [Min - Max] | [15.50 - 20.70] | [17.00 - 21.50] | |
| bill_length_mm | <0.001 | ||
| Mean (SD) | 37.26 (2.03) | 40.39 (2.28) | |
| Median (Q1, Q3) | 37.00 (35.90, 38.80) | 40.60 (39.00, 41.50) | |
| [Min - Max] | [32.10 - 42.20] | [34.60 - 46.00] | |
| flipper_length_mm | <0.001 | ||
| Mean (SD) | 187.79 (5.60) | 192.41 (6.60) | |
| Median (Q1, Q3) | 188.00 (185.00, 191.00) | 193.00 (189.00, 197.00) | |
| [Min - Max] | [172.00 - 202.00] | [178.00 - 210.00] | |
| 1 Two Sample t-test | |||
====================================
QUESTION 12: What do you notice about these 3 dimensional measurements relative to the differences between the Males and Female Adelie penguins?
ANSWER FOR QUESTION 12 HERE:
====================================
We will fit the logistic regression model using the
glm() function. This is pretty much the same syntax as we
used above for lm() but we need to remember to set
family = binomial otherwise the default is
family = gaussian which is essentially running a linear
(normal) regression model.
NOTE: sex here is a factor variable where 1=Female and
2=Male, so they are put in order, so the model is predicting Males and
using Females as the reference category.
glm1 <- glm(sex ~ bill_depth_mm +
bill_length_mm + flipper_length_mm,
family = binomial,
data = pg2)
We can again use the tbl_regression() function from the
gtsummary package to get a nice table for the model. It is
important to remember to set
exponentiate = TRUE to get the odds ratios instead of the
original raw betas.
You’ll also notice that I added add_glance_source_note()
which adds some of the model fit statistics such as the AIC, BIC and
other deviance statistics and sample size.
gtsummary::tbl_regression(glm1,
exponentiate = TRUE) %>%
add_glance_source_note()
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| bill_depth_mm | 5.90 | 3.11, 13.0 | <0.001 |
| bill_length_mm | 2.17 | 1.63, 3.10 | <0.001 |
| flipper_length_mm | 1.04 | 0.96, 1.14 | 0.4 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
| Null deviance = 202; Null df = 145; Log-likelihood = -45.5; AIC = 98.9; BIC = 111; Deviance = 90.9; Residual df = 142; No. Obs. = 146 | |||
====================================
QUESTION 13: Based on the model table above, which variable has the highest odds ratio? and what was the final sample size used to fit the model? Are any of the variables not significant in the model?
ANSWER FOR QUESTION 13 HERE:
====================================
In linear regression we computed the “fitted” (or “predicted” values) from the model and compared them back to the original values to get an idea of how well the model did in predicting the original data.
We can do the same thing with a logistic regression model. However, it is important to remember that the “original” outcome data for logistic regression is binary and is often coded using a number like 0 versus 1, but could also be a category like “apple” versus “oranges”. So, the logistic regression model actually predicts the probability for each outcome category, i.e., the probability for predicting a “1” or probability of predicting a “0”.
Let’s look at the predicted probabilities from this model. We can use
the broom::augment() function, but we need to add
type.predict = "response" to get the probabilities in the
.fitted variable.
Learn more at https://broom.tidymodels.org/reference/augment.glm.html.
library(broom)
aglm1 <- augment(glm1,
type.predict="response")
Look at the top of this augmented output dataset.
head(aglm1)
## # A tibble: 6 × 10
## sex bill_depth_mm bill_length_mm flipper_length_mm .fitted .resid .hat
## <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 male 18.7 39.1 181 0.631 0.960 0.0548
## 2 female 17.4 39.5 186 0.222 -0.708 0.0305
## 3 female 18 40.3 195 0.690 -1.53 0.0399
## 4 female 19.3 36.7 193 0.557 -1.28 0.0599
## 5 male 20.6 39.3 190 0.988 0.154 0.00928
## 6 female 17.8 38.9 181 0.228 -0.720 0.0385
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
Notice that the 1st penguin is a male and the probability is 0.6306458 which is above 0.5.
The second penguin is a female and their probability was 0.2215024 which is well below 0.5.
Let’s use a simple cutscore of 0.5 and classify all penguins with a
probability (.fitted) > 0.5 as “males” and those with
probabilities (.fitted) <= 0.5 to be “females”.
aglm1 <- aglm1 %>%
mutate(predictedsex = ifelse(
.fitted > 0.5, "predicted male",
"predicted female"
))
Let’s look at the table of original sex (in the columns) compared to the sex predicted by the model (in the rows).
This table is the “confusion matrix”.
# make the confusion matrix
aglm1 %>%
with(table(predictedsex, sex))
## sex
## predictedsex female male
## predicted female 64 9
## predicted male 9 64
====================================
QUESTION 14: Using the output from the table above, compute the following: overall correct classification rate; False Negatives (Predicted females that were actually males) percentage; and False Positives (Predicted males that were actually females) percentage
ANSWER FOR QUESTION 14 HERE:
====================================
HINT: Use this website for computing “contingency tables” https://statpages.info/ctab2x2.html. Treat the “positive or present” category as the one the logistic regression model is predicting - in this case the “males” are the “present” or “positive” category.
One of the best model fit statistics to report for a logistic regression is to compute the AUC - area under the curve - for the ROC - receiver operating characteristic curve - which looks at the classification trade-off between false positives and false negatives on how well the model does in making predictions - this is similar to R2 for linear regression.
Ideally we want the AUC > 0.7 at a minimum, and AUC > 0.8 is good and AUC > 0.9 is very good.
# Also compute the AUC and
# plot an ROC
library(ROCR)
p <- predict(glm1, newdata=pg2,
type="response")
pr <- ROCR::prediction(p, as.numeric(pg2$sex))
prf <- ROCR::performance(pr, measure = "tpr",
x.measure = "fpr")
#plot(prf)
#abline(0, 1, col = "red")
# compute AUC, area under the curve
# also called the C-statistic
auc <- ROCR::performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
#auc
# ideally you want AUC as close to 1 as possible
# AUC > 0.9 is great
# AUC > 0.8 is good
# AUC > 0.7 is ok
# AUC < 0.7 are not very good
# AUC around 0.5 is no better than flipping a fair coin
# which is a useless model
# also - add title to plot with AUC in title
plot(prf,
main = paste("ROC Curve, AUC = ", round(auc, 3)))
abline(0, 1, col="red")
====================================
QUESTION 15: What is the AUC for this model? Is this a good model or not?
ANSWER FOR QUESTION 15 HERE:
====================================
Let’s also take a look at the pseudo-R2 values from a logistic
regression using the DescTools package and the
Learn more at https://andrisignorell.github.io/DescTools/reference/PseudoR2.html.
From help("PseudoR2", package = "DescTools")
Although there’s no commonly accepted agreement on how to assess the fit of a logistic regression, there are some approaches. The goodness of fit of the logistic regression model can be expressed by some variants of pseudo R squared statistics, most of which being based on the deviance of the model.
Details:
Cox and Snell’s R2 is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a “perfect” model.
Nagelkerke’s R2 (also sometimes called Cragg-Uhler) is an adjusted version of the Cox and Snell’s R2 that adjusts the scale of the statistic to cover the full range from 0 to 1.
McFadden’s R2 is another version, based on the log-likelihood kernels for the intercept-only model and the full estimated model.
Veall and Zimmermann concluded that from a set of six widely used measures the measure suggested by McKelvey and Zavoina had the closest correspondence to ordinary least square R2. The Aldrich-Nelson pseudo-R2 with the Veall-Zimmermann correction is the best approximation of the McKelvey-Zavoina pseudo-R2. Efron, Aldrich-Nelson, McFadden and Nagelkerke approaches severely underestimate the “true R2”.
library(DescTools)
PseudoR2(glm1,
which = "all")
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 0.5507694 0.5112435 0.5339808 0.7119744 0.4329550
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## 0.7452660 0.6237201 0.7995431 0.6167949 98.9238123
## BIC logLik logLik0 G2
## 110.8582388 -45.4619062 -101.1994884 111.4751644
====================================
QUESTION 16: Given the pseudo-R2 values, approximately how much variance in the outcome is explained by the model?
ANSWER FOR QUESTION 16 HERE:
====================================