N741 Spring 2018 - Homework 6

Homework 6

Background and Information on HELP Dataset

For homework 6, you will be working with the HELP (Health Evaluation and Linkage to Primary Care) Dataset.

The HELP Dataset:

You can learn more about the HELP (Health Evaluation and Linkage to Primary Care) dataset at https://nhorton.people.amherst.edu/sasr2/datasets.php. This dataset is also used by Ken Kleinman and Nicholas J. Horton for their book “SAS and R: Data Management, Statistical Analysis, and Graphics” (which is another helpful textbook).
You can download the datasets from their website https://nhorton.people.amherst.edu/sasr2/datasets.php
The original publication is referenced at https://www.ncbi.nlm.nih.gov/pubmed/12653820?ordinalpos=17&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum
The HELP documentation (including all forms/surveys/instruments used) are located at:
- https://nhorton.people.amherst.edu/help/
- specifically the details on all BASELINE assessments are located in this PDF https://nhorton.people.amherst.edu/help/HELP-baseline.pdf
- with the follow up time points described in the PDF https://nhorton.people.amherst.edu/help/HELP-followup.pdf

Summary of Entire HELP Dataset - Complete Codebook

See complete data descriptions and codebook at https://melindahiggins2000.github.io/N736Fall2017_HELPdataset/

Variables for Homework 6

For Homework 6, you will focus only on these variables from the HELP dataset:

Use these variables from HELP dataset for Homework 06
	Variable Label
age	Age at baseline (in years)
female	Gender of respondent
pss_fr	Perceived Social Support - friends
homeless	One or more nights on the street or shelter in past 6 months
pcs	SF36 Physical Composite Score - Baseline
mcs	SF36 Mental Composite Score - Baseline
cesd	CESD total score - Baseline

Homework 6 Assignment

SETUP Download and run the “loadHELP.R” R script (included in this Github repo https://github.com/melindahiggins2000/N741Spring2018_Homework6) to read in the HELP Dataset “helpmkh.sav”. This script also pulls out the variables you need and creates the dichotomous variable for depression cesd_gte16 which you will need for the logistic regression.

After running this R script, you will have a data frame called h1 you can use to do the rest of your analyses. You can also copy this code into your first R markdown code chunk to get you started on Homework 6.

For Homework 6, you will be looking at depression in these subjects. First, you will be running a model to look at the continuous depression measure - the CESD Center for Epidemiologic Studies Depression Scale which is a measure of depressive symptoms. Also see the APA details on the CESD at http://www.apa.org/pi/about/publications/caregivers/practice-settings/assessment/tools/depression-scale.aspx. The CESD can be used to predict actual clinical depression but it is not technically a diagnosis of depression. The CESD scores range from 0 (no depressive symptoms) to 60 (most severe depressive symptoms). You will use the (cesd) variable to run a linear regression.

The recommended threshold use to indicate potential clinical depression is for people with scores of 16 or greater. You will then use the variable created using this cutoff (cesd_gte16) to perform a similar modeling approach with the variables to predict the probability of clinical depression (using logistic regression).

Homework 6 Tasks

[Model 1] Run a simple linear regression (lm()) for cesd using the mcs variable, which is the mental component quality of life score from the SF36.
Write the equation of the final fitted model (i.e. what is the intercept and the slope)? Write a sentence describing the model results (interpret the intercept and slope). NOTE: The mcs values range form 0 to 100 where the population norm for “normal mental health quality of life” is considered to be a 50. If you score higher than 50 on the mcs you have mental health better than the population and visa versa - if your mcs scores are less than 50 then your mental health is considered to be worse than the population norm.
How much variability in the cesd does the mcs explain? (what is the R²?) Write a sentence describing how well the mcs does in predicting the cesd.
[Model 2] Run a second linear regression model (lm()) for the cesd putting in all of the other variables:
- age
- female
- pss_fr
- homeless
- pcs
- mcs
- Print out the model results with the coefficients and tests and model fit statistics.
Which variables are significant in the model? Write a sentence or two describing the impact of these variables for predicting depression scores (HINT: interpret the coefficient terms).
Following the example we did in class for the Prestige dataset https://cdn.rawgit.com/vhertzb/2018week9/2f2ea142/2018week9.html?raw=true, generate the diagnostic plotss for this model with these 6 predictors (e.g. get the residual plot by variables, the added-variable plots, the Q-Q plot, diagnostic plots). Also run the VIFs to check for multicollinearity issues.
[Model 3] Repeat Model 1 above, except this time run a logistic regression (glm()) to predict CESD scores => 16 (using the cesd_gte16 as the outcome) as a function of mcs scores. Show a summary of the final fitted model and explain the coefficients. [REMEMBER to compute the Odds Ratios after you get the raw coefficient (betas)].
Use the predict() function like we did in class to predict CESD => 16 and compare it back to the original data. For now, use a cutoff probability of 0.5 - if the probability is > 0.5 consider this to be true and false otherwise. Like we did in class. REMEMBER See the R code for the class example at https://github.com/melindahiggins2000/N741_lecture11_27March2018/blob/master/lesson11_logreg_Rcode.R
- How well did the model correctly predict CESD scores => 16 (indicating depression)? (make the “confusion matrix” and look at the true positives and true negatives versus the false positives and false negatives).
Make an ROC curve plot and compute the AUC and explain if this is a good model for predicting depression or not
Make a plot showing the probability curve - put the mcs values on the X-axis and the probability of depression on the Y-axis. Based on this plot, do you think the mcs is a good predictor of depression? [FYI This plot is also called an “effect plot” is you’re using Rcmdr to do these analyses.]

Use R markdown to complete your homework and show all of your code and output in your final report - Turn in a PDF of your report to Canvas. Include a link to your Github repo for Homework 6