For homework 6, you will be working with the HELP (Health Evaluation and Linkage to Primary Care) Dataset.
The HELP Dataset:
You can learn more about the HELP (Health Evaluation and Linkage to Primary Care) dataset at https://nhorton.people.amherst.edu/sasr2/datasets.php. This dataset is also used by Ken Kleinman and Nicholas J. Horton for their book “SAS and R: Data Management, Statistical Analysis, and Graphics” (which is another helpful textbook).
You can download the datasets from their website https://nhorton.people.amherst.edu/sasr2/datasets.php
The original publication is referenced at https://www.ncbi.nlm.nih.gov/pubmed/12653820?ordinalpos=17&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum
See complete data descriptions and codebook at https://melindahiggins2000.github.io/N736Fall2017_HELPdataset/
For Homework 6, you will focus only on these variables from the HELP dataset:
Variable Label | |
---|---|
age | Age at baseline (in years) |
female | Gender of respondent |
pss_fr | Perceived Social Support - friends |
homeless | One or more nights on the street or shelter in past 6 months |
pcs | SF36 Physical Composite Score - Baseline |
mcs | SF36 Mental Composite Score - Baseline |
cesd | CESD total score - Baseline |
SETUP Download and run the “loadHELP.R” R
script (included in this Github repo https://github.com/melindahiggins2000/N741Spring2018_Homework6) to read in the HELP Dataset “helpmkh.sav”. This script also pulls out the variables you need and creates the dichotomous variable for depression cesd_gte16
which you will need for the logistic regression.
After running this R script, you will have a data frame called h1
you can use to do the rest of your analyses. You can also copy this code into your first R markdown code chunk to get you started on Homework 6.
For Homework 6, you will be looking at depression in these subjects. First, you will be running a model to look at the continuous depression measure - the CESD Center for Epidemiologic Studies Depression Scale which is a measure of depressive symptoms. Also see the APA details on the CESD at http://www.apa.org/pi/about/publications/caregivers/practice-settings/assessment/tools/depression-scale.aspx. The CESD can be used to predict actual clinical depression but it is not technically a diagnosis of depression. The CESD scores range from 0 (no depressive symptoms) to 60 (most severe depressive symptoms). You will use the (cesd
) variable to run a linear regression.
The recommended threshold use to indicate potential clinical depression is for people with scores of 16 or greater. You will then use the variable created using this cutoff (cesd_gte16
) to perform a similar modeling approach with the variables to predict the probability of clinical depression (using logistic regression).
[Model 1] Run a simple linear regression (lm()
) for cesd
using the mcs
variable, which is the mental component quality of life score from the SF36.
Write the equation of the final fitted model (i.e. what is the intercept and the slope)? Write a sentence describing the model results (interpret the intercept and slope). NOTE: The mcs
values range form 0 to 100 where the population norm for “normal mental health quality of life” is considered to be a 50. If you score higher than 50 on the mcs
you have mental health better than the population and visa versa - if your mcs
scores are less than 50 then your mental health is considered to be worse than the population norm.
How much variability in the cesd
does the mcs
explain? (what is the R2?) Write a sentence describing how well the mcs
does in predicting the cesd
.
lm()
) for the cesd
putting in all of the other variables:
age
female
pss_fr
homeless
pcs
mcs
Print out the model results with the coefficients and tests and model fit statistics.
Which variables are significant in the model? Write a sentence or two describing the impact of these variables for predicting depression scores (HINT: interpret the coefficient terms).
Following the example we did in class for the Prestige dataset https://cdn.rawgit.com/vhertzb/2018week9/2f2ea142/2018week9.html?raw=true, generate the diagnostic plotss for this model with these 6 predictors (e.g. get the residual plot by variables, the added-variable plots, the Q-Q plot, diagnostic plots). Also run the VIFs to check for multicollinearity issues.
[Model 3] Repeat Model 1 above, except this time run a logistic regression (glm()
) to predict CESD scores => 16 (using the cesd_gte16
as the outcome) as a function of mcs
scores. Show a summary of the final fitted model and explain the coefficients. [REMEMBER to compute the Odds Ratios after you get the raw coefficient (betas)].
predict()
function like we did in class to predict CESD => 16 and compare it back to the original data. For now, use a cutoff probability of 0.5 - if the probability is > 0.5 consider this to be true and false otherwise. Like we did in class. REMEMBER See the R code for the class example at https://github.com/melindahiggins2000/N741_lecture11_27March2018/blob/master/lesson11_logreg_Rcode.R
Make an ROC curve plot and compute the AUC and explain if this is a good model for predicting depression or not
Make a plot showing the probability curve - put the mcs
values on the X-axis and the probability of depression on the Y-axis. Based on this plot, do you think the mcs
is a good predictor of depression? [FYI This plot is also called an “effect plot” is you’re using Rcmdr
to do these analyses.]
Use R markdown to complete your homework and show all of your code and output in your final report - Turn in a PDF of your report to Canvas. Include a link to your Github repo for Homework 6