Missing Data

Lesson 10: An Introduction and Discussion of Missing Data

This lesson introduces you to missing data:

how to find, identify and review missing data
how to visualize missing data
types of missing data mechanisms
what to do next

Missing data - Mechanisms

Missing Completely at Random MCAR
- For data to be MCAR means that the data that are missing do not depend on (are not associated with) any of the data values. The caution with this definition is that it is still possible the data might be missing due to some unobserved (un-measured) variable. We can’t directly determine this since the data were unobserved (can’t quantify an unknown).
Missing at Random MAR
- For data to be MAR means that the missing data are associated with another observed variable or variables but NOT to it’s own value. If we include these associated variables in our models we are “adjusting for” or “accounting for” the missing data mechanism. I’ve also heard this referred to or related to “Covariate - dependent missingness”; re: Don Hedeker http://www.caldar.org/presentations/summer%20institute/2015/Day2/Hedeker.pdf and http://bstt513.class.uic.edu/index.html - see Chapter 14 on “Missing Data”
Not Missing at Random NMAR (sometimes MNAR)
- For data to be NMAR - they are basically neither MCAR nor MAR. The missing data in question depend on the item/subject being measured. For example, a good pulse oxygen measurement may be missing because it is incresingly difficult to get a good reliable measurement if the patient is crashing.

Practical Advice

Focus initially on your key outcomes (Y’s of interest)
Run code to create an indicator variable for each outcome and set the code=0 if the data are NOT missing and code=1 is the data is missing. For example:

Y1	Y1miss
12	0
13	0
	1
34	0
	1
	1
10	0

Using the newly created indicator variables (Y1miss, Y2miss, etc), use these to run association tests, like correlation, t-tests, chi-square tests, to see if the missing data is associated with any of your other predictors or outcomes (i.e. with anything in the rest of your dataset).
Depending on what you find in step 3, you can decide if you want to proceed with:

a complete case analysis (assuming MCAR)
adding the covariates necessary to adjust for/explain the missing data patterns
or the data are NMAR - seek additional statistical advice
consider possible multiple imputation (MI) or maximum likelihood (ML) estimation options for your modeling.

Keep in mind that sometimes you have to consider the impact of intermittent missing data across multiple variables. For example, suppose you have 5 variables and each one is only missing a few responses each, but if you put all 5 together in a model - the default in nearly all statistical software - at least without setting alternate options - is to treat the missing data “listwise” and remove all row/subjects with any missing data on any of these 5 variables. So you could easily end up with much more missing data impact than you originally expected. The same advice applies, however.
- create an indicator variable to tell you if there is missing data for that row/subject for any of these 5 variables - you could even count the number of variables with missing data for each individual row/subject. For example:

var1	var2	var3	var4	var5	nmiss	missingYN
12	w		3	1	1	1
13	b	55	2	2	0	0
	b	56	1	2	1	1
34		87	2	1	1	1
	w	88	1	2	1	1
15	w		3		2	1
10	b	90	2	1	0	0

- then use the same approach described above to see if this indicator variable is associated with anything else in your dataset.
- As you can tell, “listwise” deletion could potentially “eat up” and remove much of your data undermining your sample size and representativeness of your underlying data. Hopefully, this won’t be a problem. if it is, you may have to seek alternative variables or approaches to analyzing your data.

SIDE NOTE: We will briefly discuss mean substitution as a possible option in the context of people skipping items on a given survey instrument - later this semester. This is a common practice for some instruments and have been built into the underlying psychometric properties of that measurement tool. However, mean substitution is NOT RECOMMENDED as it is a BIASED method.

Missing data - helpful references

KEY BOOK: “Statistical Analysis with Missing Data, 2nd Edition” by Roderick J. A. Little, Donald B. Rubin http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471183865.html

ALSO:

Check out what is provided on the “Quick-R” website http://www.statmethods.net/input/missingdata.html
The Manning Book website for “R in Action” https://www.manning.com/books/r-in-action-second-edition?a_bid=5c2b1e1d&a_aid=RiA2ed Chapter 18 deals with Missing Data. NOTE: if you purchase the book you can then access the entire book content online.
Visulaizing Missing Data in SAS https://blogs.sas.com/content/iml/2016/04/20/visualize-missing-data-sas.html
Examine patterns of missing data in SAS https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html
More on Visualizing Missing Data (based on Jon Fox’s Applied Regression book) http://scs.math.yorku.ca/index.php/Visualizing_missing_data - includes SAS code and macros
Potential “solutions” for missing data - Multiple Imputation (MI) and Maximum Likelihood (ML) http://www.theanalysisfactor.com/missing-data-two-recommended-solutions/
Paul Allison’s website (might get warning about unsafe website, not sure why) Discussion of MI versus ML https://m.statisticalhorizons.com/?url=https%3A%2F%2Fstatisticalhorizons.com%2Fml-better-than-mi&width=412

Missing Data

Melinda K. Higgins, PhD.

September 27, 2017

Lesson 10: An Introduction and Discussion of Missing Data

Missing data - Mechanisms

Practical Advice

Missing data - helpful references