For this exercise, we’ll be working with the HELP (Health Evaluation and Linkage to Primary Care) dataset. You can learn more about this dataset at https://nhorton.people.amherst.edu/sasr2/datasets.php. This dataset is also used by Ken Kleinman and Nicholas J. Horton for their book “SAS and R: Data Management, Statistical Analysis, and Graphics” (which is another helpful textbook). Also see https://melindahiggins2000.github.io/N736/helpdata.html.
The dataset is ready for you, just load
"help_set1_wide.RData"
.
library(tidyverse)
load("help_set1_wide.RData")
helpdata <- helpdat.set1
Let’s focus on homeless
as the main outcome variable
which is dichotomous coded as 0 and 1. We’ll use logistic regression to
look at predicting whether someone was homeless or not using these
variables:
Select these variables and save in a smaller dataset,
h1
.
h1 <- helpdata %>%
dplyr::select(homeless, female, pcs, pss_fr, indtot)
NOTE: These data were originally read in from an
SPSS formatted file using the haven
package. The advantage
of this is that the labels of the variables did come over but the
“factor levels” are not included. For example, if we look at a table of
h1$homeless
we will only see 0 and 1 and not the character
levels of “yes” and “no”. If you want these levels to have labels you
need to create a “factor” version of these variables.
Learn more about “labelled” data with the haven
package
and then with gtsummary
and related “tidyverse”
packages:
Check the status of these variables - these are helpful functions:
class()
str()
glimpse()
table()
is.factor()
# look at properties of the h1 dataset
str(h1)
## tibble [228 × 5] (S3: tbl_df/tbl/data.frame)
## $ homeless: dbl+lbl [1:228] 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, ...
## ..@ label : chr "One or more nights on the street or shelter in past 6 months"
## ..@ format.spss: chr "F1.0"
## ..@ labels : Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "no" "yes"
## $ female : dbl+lbl [1:228] 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## ..@ label : chr "Gender of respondent"
## ..@ format.spss: chr "F1.0"
## ..@ labels : Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "Male" "Female"
## $ pcs : num [1:228] 74.8 46.5 39.2 55.2 44.8 ...
## ..- attr(*, "label")= chr "SF36 Physical Composite Score - Baseline"
## ..- attr(*, "format.spss")= chr "F16.13"
## ..- attr(*, "display_width")= int 18
## $ pss_fr : num [1:228] 13 5 13 1 7 9 13 14 6 3 ...
## ..- attr(*, "label")= chr "Perceived Social Support - friends"
## ..- attr(*, "format.spss")= chr "F2.0"
## $ indtot : num [1:228] 41 29 40 38 26 17 37 27 40 34 ...
## ..- attr(*, "label")= chr "Inventory of Drug Use Consequences (InDue) total score - Baseline"
## ..- attr(*, "format.spss")= chr "F2.0"
glimpse(h1)
## Rows: 228
## Columns: 5
## $ homeless <dbl+lbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, …
## $ female <dbl+lbl> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ pcs <dbl> 74.80633, 46.47521, 39.24264, 55.16998, 44.77651, 56.43837, 3…
## $ pss_fr <dbl> 13, 5, 13, 1, 7, 9, 13, 14, 6, 3, 3, 8, 6, 11, 10, 3, 7, 10, …
## $ indtot <dbl> 41, 29, 40, 38, 26, 17, 37, 27, 40, 34, 44, 33, 43, 41, 32, 4…
# look at specifics of homeless and female
class(h1$homeless)
## [1] "haven_labelled" "vctrs_vctr" "double"
table(h1$homeless, useNA = "ifany")
##
## 0 1
## 131 97
is.factor(h1$homeless)
## [1] FALSE
class(h1$female)
## [1] "haven_labelled" "vctrs_vctr" "double"
table(h1$female, useNA = "ifany")
##
## 0 1
## 173 55
is.factor(h1$female)
## [1] FALSE
I often try to keep BOTH the numeric version of a variable along with a “factor” type version for a couple of reasons:
And given the original use of haven
we can use the haven
function as_factor() to create a factor type variable and preserve the
labeling from SPSS.
Let’s explore the attributes()
of each version and see
what happens if we try to force the variables
as.numeric()
.
# look at attributes of original variable
attributes(h1$homeless)
## $label
## [1] "One or more nights on the street or shelter in past 6 months"
##
## $format.spss
## [1] "F1.0"
##
## $class
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $labels
## no yes
## 0 1
# and look at data values
h1$homeless
## <labelled<double>[228]>: One or more nights on the street or shelter in past 6 months
## [1] 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0
## [38] 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0
## [75] 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1
## [112] 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1
## [149] 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0
## [186] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1
## [223] 1 1 1 0 0 1
##
## Labels:
## value label
## 0 no
## 1 yes
as.numeric(h1$homeless)
## [1] 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0
## [38] 0 1 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0
## [75] 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1
## [112] 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1
## [149] 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0
## [186] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1
## [223] 1 1 1 0 0 1
# now create a factor type version
h1$homeless.f <- haven::as_factor(h1$homeless)
attributes(h1$homeless.f)
## $levels
## [1] "no" "yes"
##
## $class
## [1] "factor"
##
## $label
## [1] "One or more nights on the street or shelter in past 6 months"
h1$homeless.f
## [1] no no yes no yes no no no no no yes yes no no no yes yes yes
## [19] no no no yes no no yes yes no yes yes yes no yes no yes no yes
## [37] no no yes yes yes no no no yes no yes no no no yes yes yes no
## [55] yes yes no no yes yes no no yes yes yes yes yes no no yes yes yes
## [73] no no no no no yes no yes no yes no no yes yes yes no yes no
## [91] no no no no yes no no no no no no yes no no no no no yes
## [109] yes no yes no yes no no yes yes no yes no no yes no no yes no
## [127] yes yes no no no no no yes no no yes no no yes yes no yes yes
## [145] yes yes no yes yes no yes yes no yes yes no no yes no no no no
## [163] no yes no yes no yes no no yes yes no yes no yes no no yes yes
## [181] no yes yes yes no no no no no yes no no no no no yes no no
## [199] no no yes no no no no no no no yes yes no yes no no no yes
## [217] no yes no no yes yes yes yes yes no no yes
## attr(,"label")
## [1] One or more nights on the street or shelter in past 6 months
## Levels: no yes
as.numeric(h1$homeless.f)
## [1] 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 2 2 2 1 1 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1
## [38] 1 2 2 2 1 1 1 2 1 2 1 1 1 2 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 1
## [75] 1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2
## [112] 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 2 2 2 1 2
## [149] 2 1 2 2 1 2 2 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 2 2 1 2 2 2 1
## [186] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 1 2 1 1 2 2
## [223] 2 2 2 1 1 2
# do the same for female
h1$female.f <- haven::as_factor(h1$female)
Let’s look at the categorical predictors with the categorical outcome using a CrossTable() function - check for small cell counts. We can even run a chi-square test to get started evaluating the predictors (univariate).
For the continuous-numeric predictors, let’s look at comparing the predictors levels between those who are homeless versus those who are not (aka, via T-tests for a non-parametric equivalent).
Using gmodels
package:
library(gmodels)
CrossTable(x=h1$female.f, y=h1$homeless.f,
expected = FALSE,
prop.r = FALSE,
prop.c = TRUE,
prop.t = FALSE,
prop.chisq = FALSE,
chisq = TRUE,
format = "SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Column Percent |
## |-------------------------|
##
## Total Observations in Table: 228
##
## | h1$homeless.f
## h1$female.f | no | yes | Row Total |
## -------------|-----------|-----------|-----------|
## Male | 94 | 79 | 173 |
## | 71.756% | 81.443% | |
## -------------|-----------|-----------|-----------|
## Female | 37 | 18 | 55 |
## | 28.244% | 18.557% | |
## -------------|-----------|-----------|-----------|
## Column Total | 131 | 97 | 228 |
## | 57.456% | 42.544% | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 2.857585 d.f. = 1 p = 0.09094395
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 2.352824 d.f. = 1 p = 0.1250563
##
##
## Minimum expected frequency: 23.39912
Using gtsummary::tbl_cross()
NOTE: this works fine but generates a warning - ignore for now… see https://github.com/ddsjoberg/gtsummary/issues/1455.
library(gtsummary)
h1 %>%
tbl_cross(
row = female.f,
col = homeless.f,
percent = "column"
) %>%
add_p()
One or more nights on the street or shelter in past 6 months | Total | p-value1 | ||
---|---|---|---|---|
no | yes | |||
Gender of respondent | 0.091 | |||
Male | 94 (72%) | 79 (81%) | 173 (76%) | |
Female | 37 (28%) | 18 (19%) | 55 (24%) | |
Total | 131 (100%) | 97 (100%) | 228 (100%) | |
1 Pearson's Chi-squared test |
As you can see in the histograms and normal probability plots below:
pcs
is reasonably normal, perhaps slightly wider than a
typical normal distribution, but ok;pss_fr
has a uniform distribution which has “positive”
kurtosis (wider and flatter than a normal), but it is symmetric - let’s
not transform;indtot
has an obvious left skewness. We could transform
this by:
library(car)
hist(h1$pcs)
car::qqPlot(h1$pcs)
## [1] 181 1
hist(h1$pss_fr)
car::qqPlot(h1$pss_fr)
## [1] 101 102
hist(h1$indtot)
car::qqPlot(h1$indtot)
## [1] 188 19
arsenal
packagePut together a table comparing these predictors between homeless (1=“yes”) or not homeless (0=“no”).
I’ve included example code below to show how to customize your table and statistical tests:
female
run a chi-square test, display the number of
missing values (if any), along with the count and percent of each
category (column percents shown);pss_fr
and pcs
run an ANOVA (same as
t-test for 2 groups), and display the number of missing values (if any),
along with the mean and standard deviation; andindtot
run a Kruskal-Wallis non-parametric test
(same as a Mann-Whitney test for 2 groups), display the number of
missing values (if any), along with the median and interquartile range,
aka 1st and 3rd quartiles (Q1, Q3).library(arsenal)
tab1 <- tableby(
homeless.f ~
chisq(female.f, "Nmiss", "countpct") +
anova(pss_fr, "Nmiss", "meansd") +
anova(pcs, "Nmiss", "meansd") +
kwt(indtot, "Nmiss", "median","q1q3"),
data = h1
)
# get table, add title and footnote for p-value tests
summary(tab1,
title = "Characteristics of Homeless No vs Yes",
pfootnote = TRUE)
no (N=131) | yes (N=97) | Total (N=228) | p value | |
---|---|---|---|---|
Gender of respondent | 0.0911 | |||
Male | 94 (71.8%) | 79 (81.4%) | 173 (75.9%) | |
Female | 37 (28.2%) | 18 (18.6%) | 55 (24.1%) | |
Perceived Social Support - friends | 0.0312 | |||
Mean (SD) | 7.580 (4.063) | 6.423 (3.862) | 7.088 (4.011) | |
SF36 Physical Composite Score - Baseline | 0.0512 | |||
Mean (SD) | 49.893 (10.595) | 47.100 (10.676) | 48.705 (10.696) | |
Inventory of Drug Use Consequences (InDue) total score - Baseline | < 0.0013 | |||
Median | 34.000 | 39.000 | 37.000 | |
Q1, Q3 | 30.000, 39.500 | 33.000, 42.000 | 31.000, 41.000 |
GGally
- a fun package for making scatterplot matrix
figuresSince the outcome only has 2 categories, female only has 2 categories and the rest are continuous numeric variables, we can put them together into a 2-way scatterplot matrix which is useful for visualizing correlations and “cross-table” associations.
First we will go ahead and create female
and
homeless
as a factor (using the as.factor()
function).
# just select the key variables
h2 <- h1 %>%
select(homeless.f, female.f, pss_fr, pcs, indtot)
library(GGally)
ggpairs(h2)
Run the plot again and use homeless.f
as a factor to add
color and see the grouping effects.
ggpairs(h2, aes(color = homeless.f))
Let’s run a logistic regression of pss_fr
to predict the
probability of being homeless
. We’ll also SAVE the
predicted probabilities and the predicted group membership.
We’ll also look at different threshold cutoffs and then look at the tradeoffs between false positives (FP) and false negatives (FN), via the classification table (also called the confusion matrix or contingency table).
We will also use get the ROC (receiver operating characteristic) curve and compute the AUC (area under the curve) to get an idea of the predictive accuracy of the model (similar to R2 for linear regression).
glm1 <- glm(homeless.f ~ pss_fr,
data=h2,
family=binomial)
glm1
##
## Call: glm(formula = homeless.f ~ pss_fr, family = binomial, data = h2)
##
## Coefficients:
## (Intercept) pss_fr
## 0.21218 -0.07323
##
## Degrees of Freedom: 227 Total (i.e. Null); 226 Residual
## Null Deviance: 311
## Residual Deviance: 306.3 AIC: 310.3
summary(glm1)
##
## Call:
## glm(formula = homeless.f ~ pss_fr, family = binomial, data = h2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2687 -1.0528 -0.9104 1.2435 1.5363
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21218 0.27210 0.780 0.436
## pss_fr -0.07323 0.03414 -2.145 0.032 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 310.99 on 227 degrees of freedom
## Residual deviance: 306.29 on 226 degrees of freedom
## AIC: 310.29
##
## Number of Fisher Scoring iterations: 4
See the raw coefficients
coef(glm1)
## (Intercept) pss_fr
## 0.21217529 -0.07323422
Compute odds ratios = exp(coefficients)
or1 <- exp(coef(glm1))
or1
## (Intercept) pss_fr
## 1.2363646 0.9293831
Get 95% confidence intervals for odds ratios
or1.ci <- exp(confint(glm1))
or1.ci
## 2.5 % 97.5 %
## (Intercept) 0.7254963 2.1150135
## pss_fr 0.8683720 0.9930925
Put these pieces of output together
data.frame(or1, or1.ci)
## or1 X2.5.. X97.5..
## (Intercept) 1.2363646 0.7254963 2.1150135
## pss_fr 0.9293831 0.8683720 0.9930925
Add better column names
or1.df <- data.frame(or1, or1.ci)
names(or1.df) <- c("odds ratio", "95% CI LB", "95% CI UB")
or1.df
## odds ratio 95% CI LB 95% CI UB
## (Intercept) 1.2363646 0.7254963 2.1150135
## pss_fr 0.9293831 0.8683720 0.9930925
Use knitr::kable()
to make a better table
knitr::kable(or1.df,
caption = "Odds Ratios from LogReg Model")
odds ratio | 95% CI LB | 95% CI UB | |
---|---|---|---|
(Intercept) | 1.2363646 | 0.7254963 | 2.1150135 |
pss_fr | 0.9293831 | 0.8683720 | 0.9930925 |
gtsummary::tbl_regression()
See example for how to set this up at https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html.
library(gtsummary)
# format results into data frame with global p-values
glm1 %>%
tbl_regression(
exponentiate = TRUE,
pvalue_fun = ~ style_pvalue(.x, digits = 2)) %>%
add_global_p() %>%
bold_p(t = 0.10) %>%
bold_labels() %>%
italicize_levels()
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Perceived Social Support - friends | 0.93 | 0.87, 0.99 | 0.030 |
1 OR = Odds Ratio, CI = Confidence Interval |
Get predicted probability of being homeless.
glm1.predict <- predict(glm1, newdata=h2,
type="response")
Plot probability of being homeless by the continuous pss_fr predictor.
plot(h2$pss_fr, glm1.predict)
abline(0.5, 0, col = "red")
can also make plot using ggplot2
h2.predict1 <- cbind(h2, glm1.predict)
library(ggplot2)
ggplot(h2.predict1, aes(pss_fr, glm1.predict)) +
geom_point() +
geom_smooth(color = "blue") +
geom_hline(yintercept = 0.5, color = "red")
Look at how well or poorly the model does predicting homelessness with different cutoff probabilities. Each cutoff will produce different FN and FP tradeoffs. The cutoff probability will range from >0 to <1.
The ROWS show the ORIGINAL category (0=not homeless, 1=homeless). The COLUMNS show whether the model predicted homeless correct (TRUE) or not (FALSE).
For the 1st cutoff of 0.5, there are:
and then moving the cutpoint down to 0.4, decreases the FN but increases the FP.
#confusion matrix using cutpoint of 0.5
table(h2$homeless.f, glm1.predict > 0.5)
##
## FALSE TRUE
## no 111 20
## yes 80 17
#confusion matrix using cutpoint of 0.4
table(h2$homeless.f, glm1.predict > 0.4)
##
## FALSE TRUE
## no 60 71
## yes 29 68
caret
packageThe caret
package has a lot of helpful functions, but
let’s look at the confusionMatrix()
function to get
detailed output statistics associated with a contingency table. Learn
more at:
Also check out this cool website https://statpages.info/ctab2x2.html for all the stats you can compute from 2-x-2 tables. See if you can replicate these stats with their online calculator.
library(caret)
Save table output where you list predicted first, then original values.
The goal is to create a table with this layout:
Original | Event | non event |
---|---|---|
Predicted | ||
TRUE | A = TP | B = FP |
FALSE | C = FN | D = TN |
seehttps://rdrr.io/cran/caret/man/confusionMatrix.html
This should generate:
tab1 <- table(glm1.predict < 0.5,
h2$homeless.f == "no")
caret::confusionMatrix(tab1)
## Confusion Matrix and Statistics
##
##
## FALSE TRUE
## FALSE 17 20
## TRUE 80 111
##
## Accuracy : 0.5614
## 95% CI : (0.4944, 0.6268)
## No Information Rate : 0.5746
## P-Value [Acc > NIR] : 0.6813
##
## Kappa : 0.0246
##
## Mcnemar's Test P-Value : 3.635e-09
##
## Sensitivity : 0.17526
## Specificity : 0.84733
## Pos Pred Value : 0.45946
## Neg Pred Value : 0.58115
## Prevalence : 0.42544
## Detection Rate : 0.07456
## Detection Prevalence : 0.16228
## Balanced Accuracy : 0.51129
##
## 'Positive' Class : FALSE
##
Given the cutoff of 0.5m the calculations are as follows:
So:
Rather than doing trial and error to find the sweet spot, if we run through a range of cutoffs from over 0.0 to just under 1.0, we can look at the tradeoffs and plot this receiver operating curve. The ROC is a plot of the:
Ideally, we’d like to see the curve as close to the upper left quadrant of the 0-1, 0-1 square plot space. There is a 45 degree diagonal line drawn through the middle which represents an AUC of 0.5 indicating the model does no better than flipping a coin (50/50) to get the outcome category correct.
Let’s compute the AUC and plot an ROC.
library(ROCR)
p <- predict(glm1, newdata=h2,
type="response")
pr <- prediction(p, as.numeric(h2$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col = "red")
Compute AUC, area under the curve also called the C-statistic.
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.5825923
ideally you want AUC as close to 1 as possible AUC > 0.9 is great AUC > 0.8 is good AUC > 0.7 is ok AUC < 0.7 are not very good AUC around 0.5 is no better than flipping a fair coin which is a useless model also - add title to plot with AUC in title
plot(prf,
main = paste("ROC Curve, AUC = ", round(auc, 3)))
abline(0, 1, col="red")
glm3 <- glm(homeless.f ~ female.f + pss_fr + pcs + indtot,
data=h2, family=binomial)
glm3
##
## Call: glm(formula = homeless.f ~ female.f + pss_fr + pcs + indtot,
## family = binomial, data = h2)
##
## Coefficients:
## (Intercept) female.fFemale pss_fr pcs indtot
## -1.01825 -0.32523 -0.04347 -0.02271 0.06171
##
## Degrees of Freedom: 227 Total (i.e. Null); 223 Residual
## Null Deviance: 311
## Residual Deviance: 291.7 AIC: 301.7
summary(glm3)
##
## Call:
## glm(formula = homeless.f ~ female.f + pss_fr + pcs + indtot,
## family = binomial, data = h2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6081 -1.0271 -0.7233 1.1258 2.2149
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.01825 1.23297 -0.826 0.40889
## female.fFemale -0.32523 0.36293 -0.896 0.37019
## pss_fr -0.04347 0.03633 -1.196 0.23160
## pcs -0.02271 0.01350 -1.683 0.09241 .
## indtot 0.06171 0.02346 2.631 0.00851 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 310.99 on 227 degrees of freedom
## Residual deviance: 291.70 on 223 degrees of freedom
## AIC: 301.7
##
## Number of Fisher Scoring iterations: 4
coef(glm3)
## (Intercept) female.fFemale pss_fr pcs indtot
## -1.01824694 -0.32522757 -0.04346571 -0.02271106 0.06171333
exp(coef(glm3))
## (Intercept) female.fFemale pss_fr pcs indtot
## 0.3612276 0.7223629 0.9574654 0.9775449 1.0636574
exp(confint(glm3))
## 2.5 % 97.5 %
## (Intercept) 0.03060536 3.936927
## female.fFemale 0.35050025 1.464009
## pss_fr 0.89114133 1.027995
## pcs 0.95166482 1.003577
## indtot 1.01744640 1.115883
Put pieces together and look at predicted values using a cutoff of 0.5.
data.frame(exp(coef(glm3)),
exp(confint(glm3)))
## exp.coef.glm3.. X2.5.. X97.5..
## (Intercept) 0.3612276 0.03060536 3.936927
## female.fFemale 0.7223629 0.35050025 1.464009
## pss_fr 0.9574654 0.89114133 1.027995
## pcs 0.9775449 0.95166482 1.003577
## indtot 1.0636574 1.01744640 1.115883
glm3.predict <- predict(glm3, newdata=h2,
type="response")
gmodels::CrossTable(h2$homeless.f, glm3.predict > 0.5,
prop.r = FALSE,
prop.c = FALSE,
prop.t = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 228
##
##
## | glm3.predict > 0.5
## h2$homeless.f | FALSE | TRUE | Row Total |
## --------------|-----------|-----------|-----------|
## no | 103 | 28 | 131 |
## | 2.810 | 5.621 | |
## | 0.452 | 0.123 | |
## --------------|-----------|-----------|-----------|
## yes | 49 | 48 | 97 |
## | 3.796 | 7.591 | |
## | 0.215 | 0.211 | |
## --------------|-----------|-----------|-----------|
## Column Total | 152 | 76 | 228 |
## --------------|-----------|-----------|-----------|
##
##
FYI: See https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
Make an ROC curve for model glm3
library(ROCR)
p <- predict(glm3, newdata=h2,
type="response")
pr <- prediction(p, as.numeric(h2$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col="red")
auc3 <- performance(pr, measure = "auc")
auc3 <- auc3@y.values[[1]]
auc3
## [1] 0.6827733
add title to plot with AUC in title
plot(prf,
main = paste("ROC Curve, AUC = ", round(auc3, 3)))
abline(0, 1, col="red")
Compare 2 models:
glm1
with pss_fr
glm3
with female
+ pss_fr
+
pcs
+ indtot
Note: lower AIC and BIC is better and higher AUC is better
AIC(glm1, glm3)
## df AIC
## glm1 2 310.2932
## glm3 5 301.7022
BIC(glm1, glm3)
## df BIC
## glm1 2 317.1519
## glm3 5 318.8489
c(auc, auc3)
## [1] 0.5825923 0.6827733
For nested models we can compare glm3
and
glm1
: note - test options are “Rao”, “LRT”, “Chisq”, “F”,
“Cp”, see help(“anova.glm”)
anova(glm1, glm3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 226 306.29
## 2 223 291.70 3 14.591 0.002202 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(glm1, glm3, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 226 306.29
## 2 223 291.70 3 14.591 0.002202 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(glm1, glm3, test = "Rao")
## Analysis of Deviance Table
##
## Model 1: homeless.f ~ pss_fr
## Model 2: homeless.f ~ female.f + pss_fr + pcs + indtot
## Resid. Df Resid. Dev Df Deviance Rao Pr(>Chi)
## 1 226 306.29
## 2 223 291.70 3 14.591 13.549 0.003587 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
at1 <- anova(glm1, glm3, test = "Chisq")
pull out the p-values
at1$`Pr(>Chi)`[2]
## [1] 0.00220175
See example at https://www.danieldsjoberg.com/gtsummary/articles/gallery.html#side-by-side.
glm1_tab <- glm1 %>%
tbl_regression(exponentiate = TRUE)
glm3_tab <- glm3 %>%
tbl_regression(exponentiate = TRUE)
theme_gtsummary_compact()
tbl_merge(list(glm1_tab, glm3_tab),
tab_spanner = c("Unadjusted",
"Adjusted"))
Characteristic | Unadjusted | Adjusted | ||||
---|---|---|---|---|---|---|
OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | |
Perceived Social Support - friends | 0.93 | 0.87, 0.99 | 0.032 | 0.96 | 0.89, 1.03 | 0.2 |
Gender of respondent | ||||||
Male | — | — | ||||
Female | 0.72 | 0.35, 1.46 | 0.4 | |||
SF36 Physical Composite Score - Baseline | 0.98 | 0.95, 1.00 | 0.092 | |||
Inventory of Drug Use Consequences (InDue) total score - Baseline | 1.06 | 1.02, 1.12 | 0.009 | |||
1 OR = Odds Ratio, CI = Confidence Interval |
See sjplot
package, https://strengejacke.github.io/sjPlot/articles/plot_model_estimates.html
library(sjPlot)
sjPlot::plot_model(glm3, vline.color = "red")
Notice that scale is a log scale on X-axis, see https://andrewpwheeler.com/2013/10/26/odds-ratios-need-to-be-graphed-on-log-scales/
Other articles to explore - packages for presenting model summaries:
Load HELP dataset 2 (the other random half of the original HELP dataset).
load("help_set2_wide.RData")
h1test <- helpdat.set2 %>%
dplyr::select(homeless, female, pcs, pss_fr, indtot)
h1test <- h1test %>%
mutate(female.f = haven::as_factor(female),
homeless.f = haven::as_factor(homeless))
h2test <- h1test %>%
select(homeless.f, female.f, pss_fr, pcs, indtot)
library(ROCR)
p <- predict(glm3, newdata=h2test,
type="response")
pr <- prediction(p, as.numeric(h2test$homeless.f))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#abline(0, 1, col="red")
auc3test <- performance(pr, measure = "auc")
auc3test <- auc3test@y.values[[1]]
auc3test
## [1] 0.6269753
plot(prf,
main = paste("ROC Curve, AUC = ", round(auc3test, 3)))
abline(0, 1, col="red")
# print AUC for model using train and test data
paste0("AUC with training data ", round(auc3, digits = 3))
## [1] "AUC with training data 0.683"
paste0("AUC with test data ", round(auc3test, digits = 3))
## [1] "AUC with test data 0.627"