cesd
: “Center for Epidemiological Studies-Depression (CESD) total score - Baseline”indtot
: “Inventory of Drug Use Consequences (InDue) total score - Baseline”cesd
or indtot
and if so, what transformation you applied and why - you can also decide not to transform (i.e. tradeoffs between model fit and interpretability of your results) - discuss your reasoning.Here is the code and output after running a simple linear regression using lm()
function.
library(tidyverse)
library(haven)
helpdat <- haven::read_spss("helpmkh.sav")
# create subset
# select indtot, cesd and racegrp
h1 <- helpdat %>%
select(indtot, cesd, racegrp)
# run simple linear regression
# using the lm
# save the results in the fit1 object
fit1 <- lm(cesd ~ indtot, data=h1)
# look at a summary() of the model
summary(fit1)
##
## Call:
## lm(formula = cesd ~ indtot, data = h1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.356 -7.658 0.644 8.057 30.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.86597 2.82788 4.196 3.27e-05 ***
## indtot 0.58725 0.07761 7.566 2.18e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.8 on 451 degrees of freedom
## Multiple R-squared: 0.1126, Adjusted R-squared: 0.1107
## F-statistic: 57.25 on 1 and 451 DF, p-value: 2.176e-13
I probably would not do a transformation since the residuals look fairly normal (see diagnostic plots below), but you could run the spreadLevelPlot()
function from the car
package to see if a power transformation is suggested. This suggests a power transformation of 1.639 which could be rounded up to 2. This is optional and not needed for this data.
library(car)
# look at the spreadLevelPlot
# this also provides a suggestion of
# possible power transformation
car::spreadLevelPlot(fit1)
##
## Suggested power transformation: 1.63886
Plus the gvlma()
function from the gvlma
package can be run to check model assumptions, which also all look ok.
# global test of linear model assumptions
# install gvlma package
library(gvlma)
gvmodel <- gvlma::gvlma(fit1)
summary(gvmodel)
##
## Call:
## lm(formula = cesd ~ indtot, data = h1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.356 -7.658 0.644 8.057 30.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.86597 2.82788 4.196 3.27e-05 ***
## indtot 0.58725 0.07761 7.566 2.18e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.8 on 451 degrees of freedom
## Multiple R-squared: 0.1126, Adjusted R-squared: 0.1107
## F-statistic: 57.25 on 1 and 451 DF, p-value: 2.176e-13
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma::gvlma(x = fit1)
##
## Value p-value Decision
## Global Stat 6.65298 0.1554 Assumptions acceptable.
## Skewness 2.17129 0.1406 Assumptions acceptable.
## Kurtosis 2.89772 0.0887 Assumptions acceptable.
## Link Function 1.52631 0.2167 Assumptions acceptable.
## Heteroscedasticity 0.05766 0.8102 Assumptions acceptable.
A good set of 4 diagnostic plots can be obtained using the plot()
function for the fitted model output.
These 4 diagnostic plots show:
# get diagnostic plots
par(mfrow=c(2,2))
plot(fit1)
# reset par
par(mfrow=c(1,1))
The histogram of the residuals look normal. No skewness and no obvious outliers.
# histogram of the residuals
hist(fit1$residuals)
There is quite a bit of scatter in the plot with a lot of variability in both indtot
and cesd
, so a linear fit line looks as good as any - there is no obvious curvature to the data.
plot(h1$indtot, h1$cesd)
abline(lm(h1$cesd ~ h1$indtot))
# using a ggplot2 approach
ggplot(h1, aes(indtot, cesd)) +
geom_point() +
stat_smooth(method = lm)
library(knitr)
library(xtable)
print(xtable(summary(fit1)), type = "html")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 11.8660 | 2.8279 | 4.20 | 0.0000 |
indtot | 0.5873 | 0.0776 | 7.57 | 0.0000 |
The variablity explained by this model is 11.0676446%, which is the adjusted r2 captured as summary(fit1)$adj.r.squared
.
For indtot
scores equal to 0, the model estimates that a subject would have a cesd
score of 11.8659738 given the y-intercept. For each 1 point increase in indtot
scores, the cesd
score will increase on average by 0.5872545 based on the slope estimate.
It is worth noting that given the wide variability in both the indtot
and cesd
a linear fit line indicates a weak positive correlation between these 2 variables, but the residuals do show wide variation about the best fitted line, indicating that the linear trend is weak at best.
cesd
: “Center for Epidemiological Studies-Depression (CESD) total score - Baseline”racegrp
: “Racial Group of Respondent”# one-way ANOVA
# we can use the lm() function
# it does "dummy" coding on the fly
# run racegrp as either the character
# type or as a factor - either will work
fit2.lm <- lm(cesd ~ racegrp, data=h1)
summary(fit2.lm)
##
## Call:
## lm(formula = cesd ~ racegrp, data = h1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.3012 -8.3012 0.8199 8.6400 27.8199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.1801 0.8468 35.642 < 2e-16 ***
## racegrphispanic 4.1799 1.9346 2.161 0.0313 *
## racegrpother 5.7430 2.5565 2.246 0.0252 *
## racegrpwhite 5.1211 1.2761 4.013 7.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.3 on 449 degrees of freedom
## Multiple R-squared: 0.04042, Adjusted R-squared: 0.03401
## F-statistic: 6.304 on 3 and 449 DF, p-value: 0.0003396
# the aov() function
# gives the global test for the "group" effect
fit2.aov <- aov(cesd ~ racegrp, data=h1)
summary(fit2.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## racegrp 3 2861 953.7 6.304 0.00034 ***
## Residuals 449 67927 151.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, racegrp
is significant - there are significant differences between the 4 races.
Since racegrp
was significant, let’s run all of the pairwise comparisons. The code below will show the options for adjusting the error-rate due to multiple pairwise comparisons including: Bonferroni, Holm, and Tukey HSD.
# post hoc tests
# Tukey HSD
TukeyHSD(fit2.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cesd ~ racegrp, data = h1)
##
## $racegrp
## diff lwr upr p adj
## hispanic-black 4.1799052 -0.8088317 9.168642 0.1359780
## other-black 5.7429821 -0.8494072 12.335372 0.1125093
## white-black 5.1211100 1.8305336 8.411686 0.0004071
## other-hispanic 1.5630769 -6.1057988 9.231953 0.9528809
## white-hispanic 0.9412048 -4.1754298 6.057839 0.9647000
## white-other -0.6218721 -7.3115695 6.067825 0.9951504
# using Bonferroni error-rate correction
pairwise.t.test(h1$cesd, h1$racegrp, p.adj = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: h1$cesd and h1$racegrp
##
## black hispanic other
## hispanic 0.18753 - -
## other 0.15097 1.00000 -
## white 0.00042 1.00000 1.00000
##
## P value adjustment method: bonferroni
# using the Holm error-rate correction
pairwise.t.test(h1$cesd, h1$racegrp, p.adj = "holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: h1$cesd and h1$racegrp
##
## black hispanic other
## hispanic 0.12581 - -
## other 0.12581 1.00000 -
## white 0.00042 1.00000 1.00000
##
## P value adjustment method: holm
These pairwise comparisons show that there are significant differences between white-black, but none of the other pairwise comparisons were significant.
# barlett's test for homogenity of variances
# note: put the formula back in
bartlett.test(cesd ~ racegrp, data=h1)
##
## Bartlett test of homogeneity of variances
##
## data: cesd by racegrp
## Bartlett's K-squared = 3.4367, df = 3, p-value = 0.3291
This was not significant, so we can report the usual F-statistic test for racegrp
rather than the “robust” Welch’s F-test.
racegrp
) significant?# get a means plot using
# plotmeans() from gplots package
library(gplots)
gplots::plotmeans(cesd ~ racegrp,
data=h1)
racegrp
was significant - no significant diff
print(xtable(summary(fit2.aov)),
type = "html")
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
racegrp | 3 | 2861.03 | 953.68 | 6.30 | 0.0003 |
Residuals | 449 | 67927.46 | 151.29 |