Univariate Descriptive Stats
This lesson focuses on describing and summarizing your data well. We will cover not only descriptive statistics appropriate for each type of variable, but other considerations such as sample size and missing data which may vary for each measurement and variable in your dataset. Try to think of everything you want to cover in the methods section of your manuscript (project report) and the beginning of your results section. The stats computed here will be used to help layout “Table 1” which typically summarizes the data used in all analyses within the manuscript/report.
The HELP Dataset
For today and several future lessons, we’ll be working with the HELP (Health Evaluation and Linkage to Primary Care) dataset. You can learn more about this dataset at https://nhorton.people.amherst.edu/sasr2/datasets.php. This dataset is also used by Ken Kleinman and Nicholas J. Horton for their book “SAS and R: Data Management, Statistical Analysis, and Graphics” (which is another helpful textbook).
Descriptive Stats
When putting together your descriptive statistical summaries, it is usually easiest to summarize continuous/numerical data separately from categorical/ordinal data types.
Continuous/Numerical Data
- Sample Size
- Amount of Missing Data
- Range [min, max]
- Measure of Centrality - mean (parametric) or median (non-parametric)
- Measure of Variance (spread) - standard deviation (parametric), IQR [25th, 75th] or other percentiles [2.5th, 97.5th] (non-parametric)
- Distribution - normal, symmetric, skewed, multi-modal, etc.
- use to help assess parametric or non-parametric stats
- determine if/how to mathematical transform the data (sqrt, LN, 1/x, others…)
- determine if/how to (perhaps) recode
- median split,
- zero-inflated (or ceiling-inflated) splits,
- ordinalize the data (break into thirds, quartiles, etc)
- break along clinical cutpoints
Categorical/Ordinal Data
- Sample Size
- Amount of Missing Data
- Range [min, max]
- Review frequency tables (counts and percents)
- look for “distribution” of large % or low % (>95%, <5%)
- look for counts (expected counts < 5)
- may want to collapse categories together
- might consider dichotomizing the categories
- married/partnered vs single/divorced/widowed
- caucasian vs minorities
- heterosexual vs homosexual/bisexual/transexual/other
- may want to drop categories
- (does it make sense to merge 2 subjects without disease but with a rare condition with other subjects without the disease) - or if this rare condition may be important to the outcome(s) of interest and is under-represented, these 2 subjects might be removed until a future study can investigate this issue further
R Code
Run examples shown in lesson07_Rcode.R
SAS Code
Run examples shown in lesson07_SAScode.sas