Lesson 10: An Introduction and Discussion of Missing Data

This lesson introduces you to missing data:

Missing data - Mechanisms

Practical Advice

  1. Focus initially on your key outcomes (Y’s of interest)
  2. Run code to create an indicator variable for each outcome and set the code=0 if the data are NOT missing and code=1 is the data is missing. For example:
Y1 Y1miss
12 0
13 0
34 0
10 0
  1. Using the newly created indicator variables (Y1miss, Y2miss, etc), use these to run association tests, like correlation, t-tests, chi-square tests, to see if the missing data is associated with any of your other predictors or outcomes (i.e. with anything in the rest of your dataset).

  2. Depending on what you find in step 3, you can decide if you want to proceed with:

  1. Keep in mind that sometimes you have to consider the impact of intermittent missing data across multiple variables. For example, suppose you have 5 variables and each one is only missing a few responses each, but if you put all 5 together in a model - the default in nearly all statistical software - at least without setting alternate options - is to treat the missing data “listwise” and remove all row/subjects with any missing data on any of these 5 variables. So you could easily end up with much more missing data impact than you originally expected. The same advice applies, however.
    • create an indicator variable to tell you if there is missing data for that row/subject for any of these 5 variables - you could even count the number of variables with missing data for each individual row/subject. For example:
var1 var2 var3 var4 var5 nmiss missingYN
12 w 3 1 1 1
13 b 55 2 2 0 0
b 56 1 2 1 1
34 87 2 1 1 1
w 88 1 2 1 1
15 w 3 2 1
10 b 90 2 1 0 0

SIDE NOTE: We will briefly discuss mean substitution as a possible option in the context of people skipping items on a given survey instrument - later this semester. This is a common practice for some instruments and have been built into the underlying psychometric properties of that measurement tool. However, mean substitution is NOT RECOMMENDED as it is a BIASED method.

Missing data - helpful references

KEY BOOK: “Statistical Analysis with Missing Data, 2nd Edition” by Roderick J. A. Little, Donald B. Rubin http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471183865.html


  1. Check out what is provided on the “Quick-R” website http://www.statmethods.net/input/missingdata.html

  2. The Manning Book website for “R in Action” https://www.manning.com/books/r-in-action-second-edition?a_bid=5c2b1e1d&a_aid=RiA2ed Chapter 18 deals with Missing Data. NOTE: if you purchase the book you can then access the entire book content online.

  3. Visulaizing Missing Data in SAS https://blogs.sas.com/content/iml/2016/04/20/visualize-missing-data-sas.html

  4. Examine patterns of missing data in SAS https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html

  5. More on Visualizing Missing Data (based on Jon Fox’s Applied Regression book) http://scs.math.yorku.ca/index.php/Visualizing_missing_data - includes SAS code and macros

  6. Potential “solutions” for missing data - Multiple Imputation (MI) and Maximum Likelihood (ML) http://www.theanalysisfactor.com/missing-data-two-recommended-solutions/

  7. Paul Allison’s website (might get warning about unsafe website, not sure why) Discussion of MI versus ML https://m.statisticalhorizons.com/?url=https%3A%2F%2Fstatisticalhorizons.com%2Fml-better-than-mi&width=412