Lesson - logistic regression

H.E.L.P. Dataset

For this exercise, we’ll be working with the HELP (Health Evaluation and Linkage to Primary Care) dataset. You can learn more about this dataset at https://nhorton.people.amherst.edu/sasr2/datasets.php. This dataset is also used by Ken Kleinman and Nicholas J. Horton for their book “SAS and R: Data Management, Statistical Analysis, and Graphics” (which is another helpful textbook). Also see https://melindahiggins2000.github.io/N736/helpdata.html.

The dataset is ready for you, just load "help_set1_wide.RData".

library(tidyverse)
load("help_set1_wide.RData")
helpdata <- helpdat.set1

HELP Variables of Focus For This Exercise

Let’s focus on homeless as the main outcome variable which is dichotomous coded as 0 and 1. We’ll use logistic regression to look at predicting whether someone was homeless or not using these variables:

female: female=1, male=0;
pss_fr: Perceived Social Support from Friends
pcs: Physical Component Score (SF-36 quality of life)
indtot: Inventory of Drug Use

Select these variables and save in a smaller dataset, h1.

h1 <- helpdata %>%
  dplyr::select(homeless, female, pcs, pss_fr, indtot)

Check variable properties

NOTE: These data were originally read in from an SPSS formatted file using the haven package. The advantage of this is that the labels of the variables did come over but the “factor levels” are not included. For example, if we look at a table of h1$homeless we will only see 0 and 1 and not the character levels of “yes” and “no”. If you want these levels to have labels you need to create a “factor” version of these variables.

Learn more about “labelled” data with the haven package and then with gtsummary and related “tidyverse” packages:

Check the status of these variables - these are helpful functions:

class()
str()
glimpse()
table()
is.factor()

# look at properties of the h1 dataset
str(h1)

## tibble [228 × 5] (S3: tbl_df/tbl/data.frame)
##  $ homeless: dbl+lbl [1:228] 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, ...
##    ..@ label      : chr "One or more nights on the street or shelter in past 6 months"
##    ..@ format.spss: chr "F1.0"
##    ..@ labels     : Named num [1:2] 0 1
##    .. ..- attr(*, "names")= chr [1:2] "no" "yes"
##  $ female  : dbl+lbl [1:228] 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
##    ..@ label      : chr "Gender of respondent"
##    ..@ format.spss: chr "F1.0"
##    ..@ labels     : Named num [1:2] 0 1
##    .. ..- attr(*, "names")= chr [1:2] "Male" "Female"
##  $ pcs     : num [1:228] 74.8 46.5 39.2 55.2 44.8 ...
##   ..- attr(*, "label")= chr "SF36 Physical Composite Score - Baseline"
##   ..- attr(*, "format.spss")= chr "F16.13"
##   ..- attr(*, "display_width")= int 18
##  $ pss_fr  : num [1:228] 13 5 13 1 7 9 13 14 6 3 ...
##   ..- attr(*, "label")= chr "Perceived Social Support - friends"
##   ..- attr(*, "format.spss")= chr "F2.0"
##  $ indtot  : num [1:228] 41 29 40 38 26 17 37 27 40 34 ...
##   ..- attr(*, "label")= chr "Inventory of Drug Use Consequences (InDue) total score - Baseline"
##   ..- attr(*, "format.spss")= chr "F2.0"

glimpse(h1)

## Rows: 228
## Columns: 5
## $ homeless <dbl+lbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, …
## $ female   <dbl+lbl> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ pcs      <dbl> 74.80633, 46.47521, 39.24264, 55.16998, 44.77651, 56.43837, 3…
## $ pss_fr   <dbl> 13, 5, 13, 1, 7, 9, 13, 14, 6, 3, 3, 8, 6, 11, 10, 3, 7, 10, …
## $ indtot   <dbl> 41, 29, 40, 38, 26, 17, 37, 27, 40, 34, 44, 33, 43, 41, 32, 4…

# look at specifics of homeless and female
class(h1$homeless)

## [1] "haven_labelled" "vctrs_vctr"     "double"

table(h1$homeless, useNA = "ifany")

## 
##   0   1 
## 131  97

is.factor(h1$homeless)

## [1] FALSE

class(h1$female)

## [1] "haven_labelled" "vctrs_vctr"     "double"

table(h1$female, useNA = "ifany")

## 
##   0   1 
## 173  55

is.factor(h1$female)

## [1] FALSE

Add “factor” type variables

I often try to keep BOTH the numeric version of a variable along with a “factor” type version for a couple of reasons:

flexibility - some functions work fine with either the numeric or factor type version but some will only work with one or the other
prettier output - having the factor version works well in tables and plots since the labels for the levels are displayed
factor-type variables ALWAYS start at 1 even if the original numeric coding started at 0 or even had negative numbers!!

And given the original use of haven we can use the haven function as_factor() to create a factor type variable and preserve the labeling from SPSS.

Let’s explore the attributes() of each version and see what happens if we try to force the variables as.numeric().

# look at attributes of original variable
attributes(h1$homeless)

## $label
## [1] "One or more nights on the street or shelter in past 6 months"
## 
## $format.spss
## [1] "F1.0"
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $labels
##  no yes 
##   0   1

# and look at data values
h1$homeless

## <labelled<double>[228]>: One or more nights on the street or shelter in past 6 months
##   [1] 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0
##  [38] 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0
##  [75] 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1
## [112] 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1
## [149] 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0
## [186] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1
## [223] 1 1 1 0 0 1
## 
## Labels:
##  value label
##      0    no
##      1   yes

as.numeric(h1$homeless)

##   [1] 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0
##  [38] 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0
##  [75] 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1
## [112] 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1
## [149] 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0
## [186] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1
## [223] 1 1 1 0 0 1

# now create a factor type version
h1$homeless.f <- haven::as_factor(h1$homeless)

attributes(h1$homeless.f)

## $levels
## [1] "no"  "yes"
## 
## $class
## [1] "factor"
## 
## $label
## [1] "One or more nights on the street or shelter in past 6 months"

h1$homeless.f

##   [1] no  no  yes no  yes no  no  no  no  no  yes yes no  no  no  yes yes yes
##  [19] no  no  no  yes no  no  yes yes no  yes yes yes no  yes no  yes no  yes
##  [37] no  no  yes yes yes no  no  no  yes no  yes no  no  no  yes yes yes no 
##  [55] yes yes no  no  yes yes no  no  yes yes yes yes yes no  no  yes yes yes
##  [73] no  no  no  no  no  yes no  yes no  yes no  no  yes yes yes no  yes no 
##  [91] no  no  no  no  yes no  no  no  no  no  no  yes no  no  no  no  no  yes
## [109] yes no  yes no  yes no  no  yes yes no  yes no  no  yes no  no  yes no 
## [127] yes yes no  no  no  no  no  yes no  no  yes no  no  yes yes no  yes yes
## [145] yes yes no  yes yes no  yes yes no  yes yes no  no  yes no  no  no  no 
## [163] no  yes no  yes no  yes no  no  yes yes no  yes no  yes no  no  yes yes
## [181] no  yes yes yes no  no  no  no  no  yes no  no  no  no  no  yes no  no 
## [199] no  no  yes no  no  no  no  no  no  no  yes yes no  yes no  no  no  yes
## [217] no  yes no  no  yes yes yes yes yes no  no  yes
## attr(,"label")
## [1] One or more nights on the street or shelter in past 6 months
## Levels: no yes

as.numeric(h1$homeless.f)

##   [1] 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 2 2 2 1 1 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1
##  [38] 1 2 2 2 1 1 1 2 1 2 1 1 1 2 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 1
##  [75] 1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2
## [112] 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 2 2 2 1 2
## [149] 2 1 2 2 1 2 2 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 2 1 2 2 2 1
## [186] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 1 2 1 1 2 2
## [223] 2 2 2 1 1 2

# do the same for female
h1$female.f <- haven::as_factor(h1$female)

Look at each predictor with the outcome

Crosstables for categorical variables

Let’s look at the categorical predictors with the categorical outcome using a CrossTable() function - check for small cell counts. We can even run a chi-square test to get started evaluating the predictors (univariate).

For the continuous-numeric predictors, let’s look at comparing the predictors levels between those who are homeless versus those who are not (aka, via T-tests for a non-parametric equivalent).

Using gmodels package:

library(gmodels)
CrossTable(x=h1$female.f, y=h1$homeless.f,
           expected = FALSE,
           prop.r = FALSE,
           prop.c = TRUE,
           prop.t = FALSE,
           prop.chisq = FALSE,
           chisq = TRUE,
           format = "SPSS")

## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |          Column Percent |
## |-------------------------|
## 
## Total Observations in Table:  228 
## 
##              | h1$homeless.f 
##  h1$female.f |       no  |      yes  | Row Total | 
## -------------|-----------|-----------|-----------|
##         Male |       94  |       79  |      173  | 
##              |   71.756% |   81.443% |           | 
## -------------|-----------|-----------|-----------|
##       Female |       37  |       18  |       55  | 
##              |   28.244% |   18.557% |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      131  |       97  |      228  | 
##              |   57.456% |   42.544% |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  2.857585     d.f. =  1     p =  0.09094395 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  2.352824     d.f. =  1     p =  0.1250563 
## 
##  
##        Minimum expected frequency: 23.39912

Using gtsummary::tbl_cross()

NOTE: this works fine but generates a warning - ignore for now… see https://github.com/ddsjoberg/gtsummary/issues/1455.

library(gtsummary)
h1 %>%
  tbl_cross(
    row = female.f,
    col = homeless.f,
    percent = "column"
  ) %>%
  add_p()

	One or more nights on the street or shelter in past 6 months		Total	p-value¹
	no	yes	Total	p-value¹
Gender of respondent				0.091
Male	94 (72%)	79 (81%)	173 (76%)
Female	37 (28%)	18 (19%)	55 (24%)
Total	131 (100%)	97 (100%)	228 (100%)
¹ Pearson's Chi-squared test

check normality of the continuous predictors

As you can see in the histograms and normal probability plots below:

pcs is reasonably normal, perhaps slightly wider than a typical normal distribution, but ok;
pss_fr has a uniform distribution which has “positive” kurtosis (wider and flatter than a normal), but it is symmetric - let’s not transform;
indtot has an obvious left skewness. We could transform this by:
- subtracting the value from the maximum of 45
- and then doing either a SQRT or LOG transform
- but let’s leave it untransformed for now.

library(car)

hist(h1$pcs)

car::qqPlot(h1$pcs)

## [1] 181   1

hist(h1$pss_fr)

car::qqPlot(h1$pss_fr)

## [1] 101 102

hist(h1$indtot)

car::qqPlot(h1$indtot)

## [1] 188  19

Comparison Table - using the `arsenal` package

Put together a table comparing these predictors between homeless (1=“yes”) or not homeless (0=“no”).

I’ve included example code below to show how to customize your table and statistical tests:

for female run a chi-square test, display the number of missing values (if any), along with the count and percent of each category (column percents shown);
for pss_fr and pcs run an ANOVA (same as t-test for 2 groups), and display the number of missing values (if any), along with the mean and standard deviation; and
for indtot run a Kruskal-Wallis non-parametric test (same as a Mann-Whitney test for 2 groups), display the number of missing values (if any), along with the median and interquartile range, aka 1st and 3rd quartiles (Q1, Q3).

library(arsenal)

tab1 <- tableby(
  homeless.f ~ 
    chisq(female.f, "Nmiss", "countpct") +
    anova(pss_fr, "Nmiss", "meansd") + 
    anova(pcs, "Nmiss", "meansd") +
    kwt(indtot, "Nmiss", "median","q1q3"),
  data = h1
)

# get table, add title and footnote for p-value tests
summary(tab1,
        title = "Characteristics of Homeless No vs Yes",
        pfootnote = TRUE)

Characteristics of Homeless No vs Yes
	no (N=131)	yes (N=97)	Total (N=228)	p value
Gender of respondent				0.091¹
Male	94 (71.8%)	79 (81.4%)	173 (75.9%)
Female	37 (28.2%)	18 (18.6%)	55 (24.1%)
Perceived Social Support - friends				0.031²
Mean (SD)	7.580 (4.063)	6.423 (3.862)	7.088 (4.011)
SF36 Physical Composite Score - Baseline				0.051²
Mean (SD)	49.893 (10.595)	47.100 (10.676)	48.705 (10.696)
Inventory of Drug Use Consequences (InDue) total score - Baseline				< 0.001³
Median	34.000	39.000	37.000
Q1, Q3	30.000, 39.500	33.000, 42.000	31.000, 41.000

Pearson’s Chi-squared test
Linear Model ANOVA
Kruskal-Wallis rank sum test

`GGally`- a fun package for making scatterplot matrix figures

Since the outcome only has 2 categories, female only has 2 categories and the rest are continuous numeric variables, we can put them together into a 2-way scatterplot matrix which is useful for visualizing correlations and “cross-table” associations.

First we will go ahead and create female and homeless as a factor (using the as.factor() function).

# just select the key variables
h2 <- h1 %>%
  select(homeless.f, female.f, pss_fr, pcs, indtot)

library(GGally)
ggpairs(h2)

Run the plot again and use homeless.f as a factor to add color and see the grouping effects.

ggpairs(h2, aes(color = homeless.f))

Univariate Logistic Regression

Let’s run a logistic regression of pss_fr to predict the probability of being homeless. We’ll also SAVE the predicted probabilities and the predicted group membership.

We’ll also look at different threshold cutoffs and then look at the tradeoffs between false positives (FP) and false negatives (FN), via the classification table (also called the confusion matrix or contingency table).

We will also use get the ROC (receiver operating characteristic) curve and compute the AUC (area under the curve) to get an idea of the predictive accuracy of the model (similar to R2 for linear regression).

glm1 <- glm(homeless.f ~ pss_fr, 
            data=h2,
            family=binomial)

glm1

## 
## Call:  glm(formula = homeless.f ~ pss_fr, family = binomial, data = h2)
## 
## Coefficients:
## (Intercept)       pss_fr  
##     0.21218     -0.07323  
## 
## Degrees of Freedom: 227 Total (i.e. Null);  226 Residual
## Null Deviance:       311 
## Residual Deviance: 306.3     AIC: 310.3

summary(glm1)

## 
## Call:
## glm(formula = homeless.f ~ pss_fr, family = binomial, data = h2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2687  -1.0528  -0.9104   1.2435   1.5363  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.21218    0.27210   0.780    0.436  
## pss_fr      -0.07323    0.03414  -2.145    0.032 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 310.99  on 227  degrees of freedom
## Residual deviance: 306.29  on 226  degrees of freedom
## AIC: 310.29
## 
## Number of Fisher Scoring iterations: 4

See the raw coefficients

coef(glm1)

## (Intercept)      pss_fr 
##  0.21217529 -0.07323422

Compute odds ratios = exp(coefficients)

or1 <- exp(coef(glm1))
or1

## (Intercept)      pss_fr 
##   1.2363646   0.9293831

Get 95% confidence intervals for odds ratios

or1.ci <- exp(confint(glm1))
or1.ci

##                 2.5 %    97.5 %
## (Intercept) 0.7254963 2.1150135
## pss_fr      0.8683720 0.9930925

Put these pieces of output together

data.frame(or1, or1.ci)

##                   or1    X2.5..   X97.5..
## (Intercept) 1.2363646 0.7254963 2.1150135
## pss_fr      0.9293831 0.8683720 0.9930925

Add better column names

or1.df <- data.frame(or1, or1.ci)
names(or1.df) <- c("odds ratio", "95% CI LB", "95% CI UB")
or1.df

##             odds ratio 95% CI LB 95% CI UB
## (Intercept)  1.2363646 0.7254963 2.1150135
## pss_fr       0.9293831 0.8683720 0.9930925

Use knitr::kable() to make a better table

knitr::kable(or1.df,
             caption = "Odds Ratios from LogReg Model")

Odds Ratios from LogReg Model
	odds ratio	95% CI LB	95% CI UB
(Intercept)	1.2363646	0.7254963	2.1150135
pss_fr	0.9293831	0.8683720	0.9930925

Using `gtsummary::tbl_regression()`

See example for how to set this up at https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html.

library(gtsummary)
# format results into data frame with global p-values
glm1 %>%
  tbl_regression(
    exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 2)) %>%
  add_global_p() %>%
  bold_p(t = 0.10) %>%
  bold_labels() %>%
  italicize_levels()

Characteristic	OR¹	95% CI¹	p-value
Perceived Social Support - friends	0.93	0.87, 0.99	0.030
¹ OR = Odds Ratio, CI = Confidence Interval

Model Predictions - Original data (Training data)

Get predicted probability of being homeless.

glm1.predict <- predict(glm1, newdata=h2,
                        type="response")

Plot probability of being homeless by the continuous pss_fr predictor.

plot(h2$pss_fr, glm1.predict)
abline(0.5, 0, col = "red")

can also make plot using ggplot2

h2.predict1 <- cbind(h2, glm1.predict)
library(ggplot2)
ggplot(h2.predict1, aes(pss_fr, glm1.predict)) +
  geom_point() +
  geom_smooth(color = "blue") +
  geom_hline(yintercept = 0.5, color = "red")

Look at how well or poorly the model does predicting homelessness with different cutoff probabilities. Each cutoff will produce different FN and FP tradeoffs. The cutoff probability will range from >0 to <1.

The ROWS show the ORIGINAL category (0=not homeless, 1=homeless). The COLUMNS show whether the model predicted homeless correct (TRUE) or not (FALSE).

For the 1st cutoff of 0.5, there are:

111 true negatives TN (model=FALSE, for not homeless)
80 false negatives FN (model=FALSE, for homeless)
20 false positives FP (model=TRUE, for not homeless)
17 true positives TP (model=TRUE, for homeless)

and then moving the cutpoint down to 0.4, decreases the FN but increases the FP.

#confusion matrix using cutpoint of 0.5
table(h2$homeless.f, glm1.predict > 0.5)

##      
##       FALSE TRUE
##   no    111   20
##   yes    80   17

#confusion matrix using cutpoint of 0.4
table(h2$homeless.f, glm1.predict > 0.4)

##      
##       FALSE TRUE
##   no     60   71
##   yes    29   68

OPTIONAL - try the `caret` package

The caret package has a lot of helpful functions, but let’s look at the confusionMatrix() function to get detailed output statistics associated with a contingency table. Learn more at:

Also check out this cool website https://statpages.info/ctab2x2.html for all the stats you can compute from 2-x-2 tables. See if you can replicate these stats with their online calculator.

library(caret)

Save table output where you list predicted first, then original values.

The goal is to create a table with this layout:

Original	Event	non event
Predicted
TRUE	A = TP	B = FP
FALSE	C = FN	D = TN

seehttps://rdrr.io/cran/caret/man/confusionMatrix.html

This should generate:

Sensitivity = A / (A+C) = TP / (TP + FN)
Specificity = D / (B+D) = TN / (FP + TN)

tab1 <- table(glm1.predict < 0.5, 
              h2$homeless.f == "no")
caret::confusionMatrix(tab1)

## Confusion Matrix and Statistics
## 
##        
##         FALSE TRUE
##   FALSE    17   20
##   TRUE     80  111
##                                           
##                Accuracy : 0.5614          
##                  95% CI : (0.4944, 0.6268)
##     No Information Rate : 0.5746          
##     P-Value [Acc > NIR] : 0.6813          
##                                           
##                   Kappa : 0.0246          
##                                           
##  Mcnemar's Test P-Value : 3.635e-09       
##                                           
##             Sensitivity : 0.17526         
##             Specificity : 0.84733         
##          Pos Pred Value : 0.45946         
##          Neg Pred Value : 0.58115         
##              Prevalence : 0.42544         
##          Detection Rate : 0.07456         
##    Detection Prevalence : 0.16228         
##       Balanced Accuracy : 0.51129         
##                                           
##        'Positive' Class : FALSE           
##

Given the cutoff of 0.5m the calculations are as follows:

111 true negatives TN (model=FALSE, for not homeless)
80 false negatives FN (model=FALSE, for homeless)
20 false positives FP (model=TRUE, for not homeless)
17 true positives TP (model=TRUE, for homeless)

So:

Sensitivity = A / (A+C) = TP / (TP + FN) = 17/97 = 0.175
Specificity = D / (B+D) = TN / (FP + TN) = 111/131 = 0.847

ROC Curve and the AUC

Rather than doing trial and error to find the sweet spot, if we run through a range of cutoffs from over 0.0 to just under 1.0, we can look at the tradeoffs and plot this receiver operating curve. The ROC is a plot of the:

Specificity = True Positive Rate = TP / (TP + FN) = TP / (original=“yes”)
1-Sensitivity = False Positive Rate = FP / (FP + TN) = FP / (original=“no”)

Ideally, we’d like to see the curve as close to the upper left quadrant of the 0-1, 0-1 square plot space. There is a 45 degree diagonal line drawn through the middle which represents an AUC of 0.5 indicating the model does no better than flipping a coin (50/50) to get the outcome category correct.

Let’s compute the AUC and plot an ROC.

library(ROCR)
p <- predict(glm1, newdata=h2, 
             type="response")
pr <- prediction(p, as.numeric(h2$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col = "red")

Compute AUC, area under the curve also called the C-statistic.

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.5825923

ideally you want AUC as close to 1 as possible AUC > 0.9 is great AUC > 0.8 is good AUC > 0.7 is ok AUC < 0.7 are not very good AUC around 0.5 is no better than flipping a fair coin which is a useless model also - add title to plot with AUC in title

plot(prf,
     main = paste("ROC Curve, AUC = ", round(auc, 3)))
abline(0, 1, col="red")

Multivariate Model - with 4 predictors entered together

glm3 <- glm(homeless.f ~ female.f + pss_fr + pcs + indtot, 
            data=h2, family=binomial)

glm3

## 
## Call:  glm(formula = homeless.f ~ female.f + pss_fr + pcs + indtot, 
##     family = binomial, data = h2)
## 
## Coefficients:
##    (Intercept)  female.fFemale          pss_fr             pcs          indtot  
##       -1.01825        -0.32523        -0.04347        -0.02271         0.06171  
## 
## Degrees of Freedom: 227 Total (i.e. Null);  223 Residual
## Null Deviance:       311 
## Residual Deviance: 291.7     AIC: 301.7

summary(glm3)

## 
## Call:
## glm(formula = homeless.f ~ female.f + pss_fr + pcs + indtot, 
##     family = binomial, data = h2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6081  -1.0271  -0.7233   1.1258   2.2149  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.01825    1.23297  -0.826  0.40889   
## female.fFemale -0.32523    0.36293  -0.896  0.37019   
## pss_fr         -0.04347    0.03633  -1.196  0.23160   
## pcs            -0.02271    0.01350  -1.683  0.09241 . 
## indtot          0.06171    0.02346   2.631  0.00851 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 310.99  on 227  degrees of freedom
## Residual deviance: 291.70  on 223  degrees of freedom
## AIC: 301.7
## 
## Number of Fisher Scoring iterations: 4

coef(glm3)

##    (Intercept) female.fFemale         pss_fr            pcs         indtot 
##    -1.01824694    -0.32522757    -0.04346571    -0.02271106     0.06171333

exp(coef(glm3))

##    (Intercept) female.fFemale         pss_fr            pcs         indtot 
##      0.3612276      0.7223629      0.9574654      0.9775449      1.0636574

exp(confint(glm3))

##                     2.5 %   97.5 %
## (Intercept)    0.03060536 3.936927
## female.fFemale 0.35050025 1.464009
## pss_fr         0.89114133 1.027995
## pcs            0.95166482 1.003577
## indtot         1.01744640 1.115883

Put pieces together and look at predicted values using a cutoff of 0.5.

data.frame(exp(coef(glm3)),
           exp(confint(glm3)))

##                exp.coef.glm3..     X2.5..  X97.5..
## (Intercept)          0.3612276 0.03060536 3.936927
## female.fFemale       0.7223629 0.35050025 1.464009
## pss_fr               0.9574654 0.89114133 1.027995
## pcs                  0.9775449 0.95166482 1.003577
## indtot               1.0636574 1.01744640 1.115883

glm3.predict <- predict(glm3, newdata=h2,
                        type="response")

gmodels::CrossTable(h2$homeless.f, glm3.predict > 0.5,
                    prop.r = FALSE,
                    prop.c = FALSE,
                    prop.t = TRUE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  228 
## 
##  
##               | glm3.predict > 0.5 
## h2$homeless.f |     FALSE |      TRUE | Row Total | 
## --------------|-----------|-----------|-----------|
##            no |       103 |        28 |       131 | 
##               |     2.810 |     5.621 |           | 
##               |     0.452 |     0.123 |           | 
## --------------|-----------|-----------|-----------|
##           yes |        49 |        48 |        97 | 
##               |     3.796 |     7.591 |           | 
##               |     0.215 |     0.211 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       152 |        76 |       228 | 
## --------------|-----------|-----------|-----------|
## 
##

FYI: See https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

Make an ROC curve for model glm3

library(ROCR)
p <- predict(glm3, newdata=h2, 
             type="response")
pr <- prediction(p, as.numeric(h2$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col="red")

auc3 <- performance(pr, measure = "auc")
auc3 <- auc3@y.values[[1]]
auc3

## [1] 0.6827733

add title to plot with AUC in title

plot(prf,
     main = paste("ROC Curve, AUC = ", round(auc3, 3)))
abline(0, 1, col="red")

Compare 2 models:

glm1 with pss_fr
glm3 with female + pss_fr + pcs + indtot

Note: lower AIC and BIC is better and higher AUC is better

AIC(glm1, glm3)

##      df      AIC
## glm1  2 310.2932
## glm3  5 301.7022

BIC(glm1, glm3)

##      df      BIC
## glm1  2 317.1519
## glm3  5 318.8489

c(auc, auc3)

## [1] 0.5825923 0.6827733

For nested models we can compare glm3 and glm1: note - test options are “Rao”, “LRT”, “Chisq”, “F”, “Cp”, see help(“anova.glm”)

anova(glm1, glm3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       226     306.29                        
## 2       223     291.70  3   14.591 0.002202 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(glm1, glm3, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       226     306.29                        
## 2       223     291.70  3   14.591 0.002202 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(glm1, glm3, test = "Rao")

## Analysis of Deviance Table
## 
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
##   Resid. Df Resid. Dev Df Deviance    Rao Pr(>Chi)   
## 1       226     306.29                               
## 2       223     291.70  3   14.591 13.549 0.003587 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

at1 <- anova(glm1, glm3, test = "Chisq")

pull out the p-values

at1$`Pr(>Chi)`[2]

## [1] 0.00220175

Merge gtsummary tables

See example at https://www.danieldsjoberg.com/gtsummary/articles/gallery.html#side-by-side.

glm1_tab <- glm1 %>%
  tbl_regression(exponentiate = TRUE)
glm3_tab <- glm3 %>%
  tbl_regression(exponentiate = TRUE)

theme_gtsummary_compact()
tbl_merge(list(glm1_tab, glm3_tab),
          tab_spanner = c("Unadjusted", 
                          "Adjusted"))

Characteristic	Unadjusted			Adjusted
Characteristic	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value
Perceived Social Support - friends	0.93	0.87, 0.99	0.032	0.96	0.89, 1.03	0.2
Gender of respondent
Male				—	—
Female				0.72	0.35, 1.46	0.4
SF36 Physical Composite Score - Baseline				0.98	0.95, 1.00	0.092
Inventory of Drug Use Consequences (InDue) total score - Baseline				1.06	1.02, 1.12	0.009
¹ OR = Odds Ratio, CI = Confidence Interval

Other fun options - plot of ORs

See sjplot package, https://strengejacke.github.io/sjPlot/articles/plot_model_estimates.html

library(sjPlot)
sjPlot::plot_model(glm3, vline.color = "red")

Notice that scale is a log scale on X-axis, see https://andrewpwheeler.com/2013/10/26/odds-ratios-need-to-be-graphed-on-log-scales/

Other articles to explore - packages for presenting model summaries:

Check model against NEW DATA - “test” dataset

Load HELP dataset 2 (the other random half of the original HELP dataset).

load("help_set2_wide.RData")
h1test <- helpdat.set2 %>%
  dplyr::select(homeless, female, pcs, pss_fr, indtot)

h1test <- h1test %>%
  mutate(female.f = haven::as_factor(female),
         homeless.f = haven::as_factor(homeless))

h2test <- h1test %>%
  select(homeless.f, female.f, pss_fr, pcs, indtot)

library(ROCR)
p <- predict(glm3, newdata=h2test, 
             type="response")
pr <- prediction(p, as.numeric(h2test$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col="red")

auc3test <- performance(pr, measure = "auc")
auc3test <- auc3test@y.values[[1]]
auc3test

## [1] 0.6269753

plot(prf,
     main = paste("ROC Curve, AUC = ", round(auc3test, 3)))
abline(0, 1, col="red")

# print AUC for model using train and test data
paste0("AUC with training data ", round(auc3, digits = 3))

## [1] "AUC with training data 0.683"

paste0("AUC with test data ", round(auc3test, digits = 3))

## [1] "AUC with test data 0.627"

Lesson - logistic regression

Melinda Higgins and Vicki Hertzberg

03/14/2023

H.E.L.P. Dataset

HELP Variables of Focus For This Exercise

Check variable properties

Add “factor” type variables

Look at each predictor with the outcome

Crosstables for categorical variables

check normality of the continuous predictors

Comparison Table - using the `arsenal` package

`GGally`- a fun package for making scatterplot matrix figures

Univariate Logistic Regression

Using `gtsummary::tbl_regression()`

Model Predictions - Original data (Training data)

OPTIONAL - try the `caret` package

ROC Curve and the AUC

Multivariate Model - with 4 predictors entered together

Merge gtsummary tables

Other fun options - plot of ORs

Check model against NEW DATA - “test” dataset

Lesson - logistic regression

Melinda Higgins and Vicki Hertzberg

03/14/2023

H.E.L.P. Dataset

HELP Variables of Focus For This Exercise

Check variable properties

Add “factor” type variables

Look at each predictor with the outcome

Crosstables for categorical variables

check normality of the continuous predictors

Comparison Table - using the arsenal package

GGally- a fun package for making scatterplot matrix figures

Univariate Logistic Regression

Using gtsummary::tbl_regression()

Model Predictions - Original data (Training data)

OPTIONAL - try the caret package

ROC Curve and the AUC

Multivariate Model - with 4 predictors entered together

Merge gtsummary tables

Other fun options - plot of ORs

Check model against NEW DATA - “test” dataset

Comparison Table - using the `arsenal` package

`GGally`- a fun package for making scatterplot matrix figures

Using `gtsummary::tbl_regression()`

OPTIONAL - try the `caret` package