H.E.L.P. Dataset

For this exercise, we’ll be working with the HELP (Health Evaluation and Linkage to Primary Care) dataset. You can learn more about this dataset at https://nhorton.people.amherst.edu/sasr2/datasets.php. This dataset is also used by Ken Kleinman and Nicholas J. Horton for their book “SAS and R: Data Management, Statistical Analysis, and Graphics” (which is another helpful textbook). Also see https://melindahiggins2000.github.io/N736/helpdata.html.

The dataset is ready for you, just load "help_set1_wide.RData".

library(tidyverse)
load("help_set1_wide.RData")
helpdata <- helpdat.set1

HELP Variables of Focus For This Exercise

Let’s focus on homeless as the main outcome variable which is dichotomous coded as 0 and 1. We’ll use logistic regression to look at predicting whether someone was homeless or not using these variables:

  • female: female=1, male=0;
  • pss_fr: Perceived Social Support from Friends
  • pcs: Physical Component Score (SF-36 quality of life)
  • indtot: Inventory of Drug Use

Select these variables and save in a smaller dataset, h1.

h1 <- helpdata %>%
  dplyr::select(homeless, female, pcs, pss_fr, indtot)

Check variable properties

NOTE: These data were originally read in from an SPSS formatted file using the haven package. The advantage of this is that the labels of the variables did come over but the “factor levels” are not included. For example, if we look at a table of h1$homeless we will only see 0 and 1 and not the character levels of “yes” and “no”. If you want these levels to have labels you need to create a “factor” version of these variables.

Learn more about “labelled” data with the haven package and then with gtsummary and related “tidyverse” packages:

Check the status of these variables - these are helpful functions:

  • class()
  • str()
  • glimpse()
  • table()
  • is.factor()
# look at properties of the h1 dataset
str(h1)
## tibble [228 × 5] (S3: tbl_df/tbl/data.frame)
##  $ homeless: dbl+lbl [1:228] 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, ...
##    ..@ label      : chr "One or more nights on the street or shelter in past 6 months"
##    ..@ format.spss: chr "F1.0"
##    ..@ labels     : Named num [1:2] 0 1
##    .. ..- attr(*, "names")= chr [1:2] "no" "yes"
##  $ female  : dbl+lbl [1:228] 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
##    ..@ label      : chr "Gender of respondent"
##    ..@ format.spss: chr "F1.0"
##    ..@ labels     : Named num [1:2] 0 1
##    .. ..- attr(*, "names")= chr [1:2] "Male" "Female"
##  $ pcs     : num [1:228] 74.8 46.5 39.2 55.2 44.8 ...
##   ..- attr(*, "label")= chr "SF36 Physical Composite Score - Baseline"
##   ..- attr(*, "format.spss")= chr "F16.13"
##   ..- attr(*, "display_width")= int 18
##  $ pss_fr  : num [1:228] 13 5 13 1 7 9 13 14 6 3 ...
##   ..- attr(*, "label")= chr "Perceived Social Support - friends"
##   ..- attr(*, "format.spss")= chr "F2.0"
##  $ indtot  : num [1:228] 41 29 40 38 26 17 37 27 40 34 ...
##   ..- attr(*, "label")= chr "Inventory of Drug Use Consequences (InDue) total score - Baseline"
##   ..- attr(*, "format.spss")= chr "F2.0"
glimpse(h1)
## Rows: 228
## Columns: 5
## $ homeless <dbl+lbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, …
## $ female   <dbl+lbl> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ pcs      <dbl> 74.80633, 46.47521, 39.24264, 55.16998, 44.77651, 56.43837, 3…
## $ pss_fr   <dbl> 13, 5, 13, 1, 7, 9, 13, 14, 6, 3, 3, 8, 6, 11, 10, 3, 7, 10, …
## $ indtot   <dbl> 41, 29, 40, 38, 26, 17, 37, 27, 40, 34, 44, 33, 43, 41, 32, 4…
# look at specifics of homeless and female
class(h1$homeless)
## [1] "haven_labelled" "vctrs_vctr"     "double"
table(h1$homeless, useNA = "ifany")
## 
##   0   1 
## 131  97
is.factor(h1$homeless)
## [1] FALSE
class(h1$female)
## [1] "haven_labelled" "vctrs_vctr"     "double"
table(h1$female, useNA = "ifany")
## 
##   0   1 
## 173  55
is.factor(h1$female)
## [1] FALSE

Add “factor” type variables

I often try to keep BOTH the numeric version of a variable along with a “factor” type version for a couple of reasons:

  1. flexibility - some functions work fine with either the numeric or factor type version but some will only work with one or the other
  2. prettier output - having the factor version works well in tables and plots since the labels for the levels are displayed
  3. factor-type variables ALWAYS start at 1 even if the original numeric coding started at 0 or even had negative numbers!!

And given the original use of haven we can use the haven function as_factor() to create a factor type variable and preserve the labeling from SPSS.

Let’s explore the attributes() of each version and see what happens if we try to force the variables as.numeric().

# look at attributes of original variable
attributes(h1$homeless)
## $label
## [1] "One or more nights on the street or shelter in past 6 months"
## 
## $format.spss
## [1] "F1.0"
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $labels
##  no yes 
##   0   1
# and look at data values
h1$homeless
## <labelled<double>[228]>: One or more nights on the street or shelter in past 6 months
##   [1] 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0
##  [38] 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0
##  [75] 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1
## [112] 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1
## [149] 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0
## [186] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1
## [223] 1 1 1 0 0 1
## 
## Labels:
##  value label
##      0    no
##      1   yes
as.numeric(h1$homeless)
##   [1] 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0
##  [38] 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0
##  [75] 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1
## [112] 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1
## [149] 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0
## [186] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1
## [223] 1 1 1 0 0 1
# now create a factor type version
h1$homeless.f <- haven::as_factor(h1$homeless)

attributes(h1$homeless.f)
## $levels
## [1] "no"  "yes"
## 
## $class
## [1] "factor"
## 
## $label
## [1] "One or more nights on the street or shelter in past 6 months"
h1$homeless.f
##   [1] no  no  yes no  yes no  no  no  no  no  yes yes no  no  no  yes yes yes
##  [19] no  no  no  yes no  no  yes yes no  yes yes yes no  yes no  yes no  yes
##  [37] no  no  yes yes yes no  no  no  yes no  yes no  no  no  yes yes yes no 
##  [55] yes yes no  no  yes yes no  no  yes yes yes yes yes no  no  yes yes yes
##  [73] no  no  no  no  no  yes no  yes no  yes no  no  yes yes yes no  yes no 
##  [91] no  no  no  no  yes no  no  no  no  no  no  yes no  no  no  no  no  yes
## [109] yes no  yes no  yes no  no  yes yes no  yes no  no  yes no  no  yes no 
## [127] yes yes no  no  no  no  no  yes no  no  yes no  no  yes yes no  yes yes
## [145] yes yes no  yes yes no  yes yes no  yes yes no  no  yes no  no  no  no 
## [163] no  yes no  yes no  yes no  no  yes yes no  yes no  yes no  no  yes yes
## [181] no  yes yes yes no  no  no  no  no  yes no  no  no  no  no  yes no  no 
## [199] no  no  yes no  no  no  no  no  no  no  yes yes no  yes no  no  no  yes
## [217] no  yes no  no  yes yes yes yes yes no  no  yes
## attr(,"label")
## [1] One or more nights on the street or shelter in past 6 months
## Levels: no yes
as.numeric(h1$homeless.f)
##   [1] 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 2 2 2 1 1 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1
##  [38] 1 2 2 2 1 1 1 2 1 2 1 1 1 2 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 1
##  [75] 1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2
## [112] 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 2 2 2 1 2
## [149] 2 1 2 2 1 2 2 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 2 1 2 2 2 1
## [186] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 1 2 1 1 2 2
## [223] 2 2 2 1 1 2
# do the same for female
h1$female.f <- haven::as_factor(h1$female)

Look at each predictor with the outcome

Crosstables for categorical variables

Let’s look at the categorical predictors with the categorical outcome using a CrossTable() function - check for small cell counts. We can even run a chi-square test to get started evaluating the predictors (univariate).

For the continuous-numeric predictors, let’s look at comparing the predictors levels between those who are homeless versus those who are not (aka, via T-tests for a non-parametric equivalent).

Using gmodels package:

library(gmodels)
CrossTable(x=h1$female.f, y=h1$homeless.f,
           expected = FALSE,
           prop.r = FALSE,
           prop.c = TRUE,
           prop.t = FALSE,
           prop.chisq = FALSE,
           chisq = TRUE,
           format = "SPSS")
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |          Column Percent |
## |-------------------------|
## 
## Total Observations in Table:  228 
## 
##              | h1$homeless.f 
##  h1$female.f |       no  |      yes  | Row Total | 
## -------------|-----------|-----------|-----------|
##         Male |       94  |       79  |      173  | 
##              |   71.756% |   81.443% |           | 
## -------------|-----------|-----------|-----------|
##       Female |       37  |       18  |       55  | 
##              |   28.244% |   18.557% |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      131  |       97  |      228  | 
##              |   57.456% |   42.544% |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  2.857585     d.f. =  1     p =  0.09094395 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  2.352824     d.f. =  1     p =  0.1250563 
## 
##  
##        Minimum expected frequency: 23.39912

Using gtsummary::tbl_cross()

NOTE: this works fine but generates a warning - ignore for now… see https://github.com/ddsjoberg/gtsummary/issues/1455.

library(gtsummary)
h1 %>%
  tbl_cross(
    row = female.f,
    col = homeless.f,
    percent = "column"
  ) %>%
  add_p()
One or more nights on the street or shelter in past 6 months Total p-value1
no yes
Gender of respondent 0.091
Male 94 (72%) 79 (81%) 173 (76%)
Female 37 (28%) 18 (19%) 55 (24%)
Total 131 (100%) 97 (100%) 228 (100%)
1 Pearson's Chi-squared test

check normality of the continuous predictors

As you can see in the histograms and normal probability plots below:

  • pcs is reasonably normal, perhaps slightly wider than a typical normal distribution, but ok;
  • pss_fr has a uniform distribution which has “positive” kurtosis (wider and flatter than a normal), but it is symmetric - let’s not transform;
  • indtot has an obvious left skewness. We could transform this by:
    • subtracting the value from the maximum of 45
    • and then doing either a SQRT or LOG transform
    • but let’s leave it untransformed for now.
library(car)

hist(h1$pcs)

car::qqPlot(h1$pcs)

## [1] 181   1
hist(h1$pss_fr)

car::qqPlot(h1$pss_fr)

## [1] 101 102
hist(h1$indtot)

car::qqPlot(h1$indtot)

## [1] 188  19

Comparison Table - using the arsenal package

Put together a table comparing these predictors between homeless (1=“yes”) or not homeless (0=“no”).

I’ve included example code below to show how to customize your table and statistical tests:

  • for female run a chi-square test, display the number of missing values (if any), along with the count and percent of each category (column percents shown);
  • for pss_fr and pcs run an ANOVA (same as t-test for 2 groups), and display the number of missing values (if any), along with the mean and standard deviation; and
  • for indtot run a Kruskal-Wallis non-parametric test (same as a Mann-Whitney test for 2 groups), display the number of missing values (if any), along with the median and interquartile range, aka 1st and 3rd quartiles (Q1, Q3).
library(arsenal)

tab1 <- tableby(
  homeless.f ~ 
    chisq(female.f, "Nmiss", "countpct") +
    anova(pss_fr, "Nmiss", "meansd") + 
    anova(pcs, "Nmiss", "meansd") +
    kwt(indtot, "Nmiss", "median","q1q3"),
  data = h1
)

# get table, add title and footnote for p-value tests
summary(tab1,
        title = "Characteristics of Homeless No vs Yes",
        pfootnote = TRUE)
Characteristics of Homeless No vs Yes
no (N=131) yes (N=97) Total (N=228) p value
Gender of respondent 0.0911
   Male 94 (71.8%) 79 (81.4%) 173 (75.9%)
   Female 37 (28.2%) 18 (18.6%) 55 (24.1%)
Perceived Social Support - friends 0.0312
   Mean (SD) 7.580 (4.063) 6.423 (3.862) 7.088 (4.011)
SF36 Physical Composite Score - Baseline 0.0512
   Mean (SD) 49.893 (10.595) 47.100 (10.676) 48.705 (10.696)
Inventory of Drug Use Consequences (InDue) total score - Baseline < 0.0013
   Median 34.000 39.000 37.000
   Q1, Q3 30.000, 39.500 33.000, 42.000 31.000, 41.000
  1. Pearson’s Chi-squared test
  2. Linear Model ANOVA
  3. Kruskal-Wallis rank sum test

GGally- a fun package for making scatterplot matrix figures

Since the outcome only has 2 categories, female only has 2 categories and the rest are continuous numeric variables, we can put them together into a 2-way scatterplot matrix which is useful for visualizing correlations and “cross-table” associations.

First we will go ahead and create female and homeless as a factor (using the as.factor() function).

# just select the key variables
h2 <- h1 %>%
  select(homeless.f, female.f, pss_fr, pcs, indtot)

library(GGally)
ggpairs(h2)

Run the plot again and use homeless.f as a factor to add color and see the grouping effects.

ggpairs(h2, aes(color = homeless.f))

Univariate Logistic Regression

Let’s run a logistic regression of pss_fr to predict the probability of being homeless. We’ll also SAVE the predicted probabilities and the predicted group membership.

We’ll also look at different threshold cutoffs and then look at the tradeoffs between false positives (FP) and false negatives (FN), via the classification table (also called the confusion matrix or contingency table).

We will also use get the ROC (receiver operating characteristic) curve and compute the AUC (area under the curve) to get an idea of the predictive accuracy of the model (similar to R2 for linear regression).

glm1 <- glm(homeless.f ~ pss_fr, 
            data=h2,
            family=binomial)

glm1
## 
## Call:  glm(formula = homeless.f ~ pss_fr, family = binomial, data = h2)
## 
## Coefficients:
## (Intercept)       pss_fr  
##     0.21218     -0.07323  
## 
## Degrees of Freedom: 227 Total (i.e. Null);  226 Residual
## Null Deviance:       311 
## Residual Deviance: 306.3     AIC: 310.3
summary(glm1)
## 
## Call:
## glm(formula = homeless.f ~ pss_fr, family = binomial, data = h2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2687  -1.0528  -0.9104   1.2435   1.5363  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.21218    0.27210   0.780    0.436  
## pss_fr      -0.07323    0.03414  -2.145    0.032 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 310.99  on 227  degrees of freedom
## Residual deviance: 306.29  on 226  degrees of freedom
## AIC: 310.29
## 
## Number of Fisher Scoring iterations: 4

See the raw coefficients

coef(glm1)
## (Intercept)      pss_fr 
##  0.21217529 -0.07323422

Compute odds ratios = exp(coefficients)

or1 <- exp(coef(glm1))
or1
## (Intercept)      pss_fr 
##   1.2363646   0.9293831

Get 95% confidence intervals for odds ratios

or1.ci <- exp(confint(glm1))
or1.ci
##                 2.5 %    97.5 %
## (Intercept) 0.7254963 2.1150135
## pss_fr      0.8683720 0.9930925

Put these pieces of output together

data.frame(or1, or1.ci)
##                   or1    X2.5..   X97.5..
## (Intercept) 1.2363646 0.7254963 2.1150135
## pss_fr      0.9293831 0.8683720 0.9930925

Add better column names

or1.df <- data.frame(or1, or1.ci)
names(or1.df) <- c("odds ratio", "95% CI LB", "95% CI UB")
or1.df
##             odds ratio 95% CI LB 95% CI UB
## (Intercept)  1.2363646 0.7254963 2.1150135
## pss_fr       0.9293831 0.8683720 0.9930925

Use knitr::kable() to make a better table

knitr::kable(or1.df,
             caption = "Odds Ratios from LogReg Model")
Odds Ratios from LogReg Model
odds ratio 95% CI LB 95% CI UB
(Intercept) 1.2363646 0.7254963 2.1150135
pss_fr 0.9293831 0.8683720 0.9930925

Using gtsummary::tbl_regression()

See example for how to set this up at https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html.

library(gtsummary)
# format results into data frame with global p-values
glm1 %>%
  tbl_regression(
    exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 2)) %>%
  add_global_p() %>%
  bold_p(t = 0.10) %>%
  bold_labels() %>%
  italicize_levels()
Characteristic OR1 95% CI1 p-value
Perceived Social Support - friends 0.93 0.87, 0.99 0.030
1 OR = Odds Ratio, CI = Confidence Interval

Model Predictions - Original data (Training data)

Get predicted probability of being homeless.

glm1.predict <- predict(glm1, newdata=h2,
                        type="response")

Plot probability of being homeless by the continuous pss_fr predictor.

plot(h2$pss_fr, glm1.predict)
abline(0.5, 0, col = "red")

can also make plot using ggplot2

h2.predict1 <- cbind(h2, glm1.predict)
library(ggplot2)
ggplot(h2.predict1, aes(pss_fr, glm1.predict)) +
  geom_point() +
  geom_smooth(color = "blue") +
  geom_hline(yintercept = 0.5, color = "red") 

Look at how well or poorly the model does predicting homelessness with different cutoff probabilities. Each cutoff will produce different FN and FP tradeoffs. The cutoff probability will range from >0 to <1.

The ROWS show the ORIGINAL category (0=not homeless, 1=homeless). The COLUMNS show whether the model predicted homeless correct (TRUE) or not (FALSE).

For the 1st cutoff of 0.5, there are:

  • 111 true negatives TN (model=FALSE, for not homeless)
  • 80 false negatives FN (model=FALSE, for homeless)
  • 20 false positives FP (model=TRUE, for not homeless)
  • 17 true positives TP (model=TRUE, for homeless)

and then moving the cutpoint down to 0.4, decreases the FN but increases the FP.

#confusion matrix using cutpoint of 0.5
table(h2$homeless.f, glm1.predict > 0.5)
##      
##       FALSE TRUE
##   no    111   20
##   yes    80   17
#confusion matrix using cutpoint of 0.4
table(h2$homeless.f, glm1.predict > 0.4)
##      
##       FALSE TRUE
##   no     60   71
##   yes    29   68

OPTIONAL - try the caret package

The caret package has a lot of helpful functions, but let’s look at the confusionMatrix() function to get detailed output statistics associated with a contingency table. Learn more at:

Also check out this cool website https://statpages.info/ctab2x2.html for all the stats you can compute from 2-x-2 tables. See if you can replicate these stats with their online calculator.

library(caret)

Save table output where you list predicted first, then original values.

The goal is to create a table with this layout:

Original Event non event
Predicted
TRUE A = TP B = FP
FALSE C = FN D = TN

seehttps://rdrr.io/cran/caret/man/confusionMatrix.html

This should generate:

  • Sensitivity = A / (A+C) = TP / (TP + FN)
  • Specificity = D / (B+D) = TN / (FP + TN)
tab1 <- table(glm1.predict < 0.5, 
              h2$homeless.f == "no")
caret::confusionMatrix(tab1)
## Confusion Matrix and Statistics
## 
##        
##         FALSE TRUE
##   FALSE    17   20
##   TRUE     80  111
##                                           
##                Accuracy : 0.5614          
##                  95% CI : (0.4944, 0.6268)
##     No Information Rate : 0.5746          
##     P-Value [Acc > NIR] : 0.6813          
##                                           
##                   Kappa : 0.0246          
##                                           
##  Mcnemar's Test P-Value : 3.635e-09       
##                                           
##             Sensitivity : 0.17526         
##             Specificity : 0.84733         
##          Pos Pred Value : 0.45946         
##          Neg Pred Value : 0.58115         
##              Prevalence : 0.42544         
##          Detection Rate : 0.07456         
##    Detection Prevalence : 0.16228         
##       Balanced Accuracy : 0.51129         
##                                           
##        'Positive' Class : FALSE           
## 

Given the cutoff of 0.5m the calculations are as follows:

  • 111 true negatives TN (model=FALSE, for not homeless)
  • 80 false negatives FN (model=FALSE, for homeless)
  • 20 false positives FP (model=TRUE, for not homeless)
  • 17 true positives TP (model=TRUE, for homeless)

So:

  • Sensitivity = A / (A+C) = TP / (TP + FN) = 17/97 = 0.175
  • Specificity = D / (B+D) = TN / (FP + TN) = 111/131 = 0.847

ROC Curve and the AUC

Rather than doing trial and error to find the sweet spot, if we run through a range of cutoffs from over 0.0 to just under 1.0, we can look at the tradeoffs and plot this receiver operating curve. The ROC is a plot of the:

  • Specificity = True Positive Rate = TP / (TP + FN) = TP / (original=“yes”)
  • 1-Sensitivity = False Positive Rate = FP / (FP + TN) = FP / (original=“no”)

Ideally, we’d like to see the curve as close to the upper left quadrant of the 0-1, 0-1 square plot space. There is a 45 degree diagonal line drawn through the middle which represents an AUC of 0.5 indicating the model does no better than flipping a coin (50/50) to get the outcome category correct.

Let’s compute the AUC and plot an ROC.

library(ROCR)
p <- predict(glm1, newdata=h2, 
             type="response")
pr <- prediction(p, as.numeric(h2$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col = "red")

Compute AUC, area under the curve also called the C-statistic.

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.5825923

ideally you want AUC as close to 1 as possible AUC > 0.9 is great AUC > 0.8 is good AUC > 0.7 is ok AUC < 0.7 are not very good AUC around 0.5 is no better than flipping a fair coin which is a useless model also - add title to plot with AUC in title

plot(prf,
     main = paste("ROC Curve, AUC = ", round(auc, 3)))
abline(0, 1, col="red")

Multivariate Model - with 4 predictors entered together

glm3 <- glm(homeless.f ~ female.f + pss_fr + pcs + indtot, 
            data=h2, family=binomial)

glm3
## 
## Call:  glm(formula = homeless.f ~ female.f + pss_fr + pcs + indtot, 
##     family = binomial, data = h2)
## 
## Coefficients:
##    (Intercept)  female.fFemale          pss_fr             pcs          indtot  
##       -1.01825        -0.32523        -0.04347        -0.02271         0.06171  
## 
## Degrees of Freedom: 227 Total (i.e. Null);  223 Residual
## Null Deviance:       311 
## Residual Deviance: 291.7     AIC: 301.7
summary(glm3)
## 
## Call:
## glm(formula = homeless.f ~ female.f + pss_fr + pcs + indtot, 
##     family = binomial, data = h2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6081  -1.0271  -0.7233   1.1258   2.2149  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.01825    1.23297  -0.826  0.40889   
## female.fFemale -0.32523    0.36293  -0.896  0.37019   
## pss_fr         -0.04347    0.03633  -1.196  0.23160   
## pcs            -0.02271    0.01350  -1.683  0.09241 . 
## indtot          0.06171    0.02346   2.631  0.00851 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 310.99  on 227  degrees of freedom
## Residual deviance: 291.70  on 223  degrees of freedom
## AIC: 301.7
## 
## Number of Fisher Scoring iterations: 4
coef(glm3)
##    (Intercept) female.fFemale         pss_fr            pcs         indtot 
##    -1.01824694    -0.32522757    -0.04346571    -0.02271106     0.06171333
exp(coef(glm3))
##    (Intercept) female.fFemale         pss_fr            pcs         indtot 
##      0.3612276      0.7223629      0.9574654      0.9775449      1.0636574
exp(confint(glm3))
##                     2.5 %   97.5 %
## (Intercept)    0.03060536 3.936927
## female.fFemale 0.35050025 1.464009
## pss_fr         0.89114133 1.027995
## pcs            0.95166482 1.003577
## indtot         1.01744640 1.115883

Put pieces together and look at predicted values using a cutoff of 0.5.

data.frame(exp(coef(glm3)),
           exp(confint(glm3)))
##                exp.coef.glm3..     X2.5..  X97.5..
## (Intercept)          0.3612276 0.03060536 3.936927
## female.fFemale       0.7223629 0.35050025 1.464009
## pss_fr               0.9574654 0.89114133 1.027995
## pcs                  0.9775449 0.95166482 1.003577
## indtot               1.0636574 1.01744640 1.115883
glm3.predict <- predict(glm3, newdata=h2,
                        type="response")

gmodels::CrossTable(h2$homeless.f, glm3.predict > 0.5,
                    prop.r = FALSE,
                    prop.c = FALSE,
                    prop.t = TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  228 
## 
##  
##               | glm3.predict > 0.5 
## h2$homeless.f |     FALSE |      TRUE | Row Total | 
## --------------|-----------|-----------|-----------|
##            no |       103 |        28 |       131 | 
##               |     2.810 |     5.621 |           | 
##               |     0.452 |     0.123 |           | 
## --------------|-----------|-----------|-----------|
##           yes |        49 |        48 |        97 | 
##               |     3.796 |     7.591 |           | 
##               |     0.215 |     0.211 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       152 |        76 |       228 | 
## --------------|-----------|-----------|-----------|
## 
## 

FYI: See https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

Make an ROC curve for model glm3

library(ROCR)
p <- predict(glm3, newdata=h2, 
             type="response")
pr <- prediction(p, as.numeric(h2$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col="red")

auc3 <- performance(pr, measure = "auc")
auc3 <- auc3@y.values[[1]]
auc3
## [1] 0.6827733

add title to plot with AUC in title

plot(prf,
     main = paste("ROC Curve, AUC = ", round(auc3, 3)))
abline(0, 1, col="red")

Compare 2 models:

  • glm1 with pss_fr
  • glm3 with female + pss_fr + pcs + indtot

Note: lower AIC and BIC is better and higher AUC is better

AIC(glm1, glm3)
##      df      AIC
## glm1  2 310.2932
## glm3  5 301.7022
BIC(glm1, glm3)
##      df      BIC
## glm1  2 317.1519
## glm3  5 318.8489
c(auc, auc3)
## [1] 0.5825923 0.6827733

For nested models we can compare glm3 and glm1: note - test options are “Rao”, “LRT”, “Chisq”, “F”, “Cp”, see help(“anova.glm”)

anova(glm1, glm3, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       226     306.29                        
## 2       223     291.70  3   14.591 0.002202 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(glm1, glm3, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       226     306.29                        
## 2       223     291.70  3   14.591 0.002202 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(glm1, glm3, test = "Rao")
## Analysis of Deviance Table
## 
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
##   Resid. Df Resid. Dev Df Deviance    Rao Pr(>Chi)   
## 1       226     306.29                               
## 2       223     291.70  3   14.591 13.549 0.003587 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
at1 <- anova(glm1, glm3, test = "Chisq")

pull out the p-values

at1$`Pr(>Chi)`[2]
## [1] 0.00220175

Merge gtsummary tables

See example at https://www.danieldsjoberg.com/gtsummary/articles/gallery.html#side-by-side.

glm1_tab <- glm1 %>%
  tbl_regression(exponentiate = TRUE)
glm3_tab <- glm3 %>%
  tbl_regression(exponentiate = TRUE)

theme_gtsummary_compact()
tbl_merge(list(glm1_tab, glm3_tab),
          tab_spanner = c("Unadjusted", 
                          "Adjusted"))
Characteristic Unadjusted Adjusted
OR1 95% CI1 p-value OR1 95% CI1 p-value
Perceived Social Support - friends 0.93 0.87, 0.99 0.032 0.96 0.89, 1.03 0.2
Gender of respondent
Male
Female 0.72 0.35, 1.46 0.4
SF36 Physical Composite Score - Baseline 0.98 0.95, 1.00 0.092
Inventory of Drug Use Consequences (InDue) total score - Baseline 1.06 1.02, 1.12 0.009
1 OR = Odds Ratio, CI = Confidence Interval

Check model against NEW DATA - “test” dataset

Load HELP dataset 2 (the other random half of the original HELP dataset).

load("help_set2_wide.RData")
h1test <- helpdat.set2 %>%
  dplyr::select(homeless, female, pcs, pss_fr, indtot)

h1test <- h1test %>%
  mutate(female.f = haven::as_factor(female),
         homeless.f = haven::as_factor(homeless))

h2test <- h1test %>%
  select(homeless.f, female.f, pss_fr, pcs, indtot)

library(ROCR)
p <- predict(glm3, newdata=h2test, 
             type="response")
pr <- prediction(p, as.numeric(h2test$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col="red")

auc3test <- performance(pr, measure = "auc")
auc3test <- auc3test@y.values[[1]]
auc3test
## [1] 0.6269753
plot(prf,
     main = paste("ROC Curve, AUC = ", round(auc3test, 3)))
abline(0, 1, col="red")

# print AUC for model using train and test data
paste0("AUC with training data ", round(auc3, digits = 3))
## [1] "AUC with training data 0.683"
paste0("AUC with test data ", round(auc3test, digits = 3))
## [1] "AUC with test data 0.627"