Reviewing the dataset - for Homework 01

In today’s class we’ll continuing exploring and working with the issues and problems with the dataset you’ll be working with for Homework 01. All of the files are uploaded into this Github repository https://github.com/melindahiggins2000/N736Homework01. Feel free to download the files (as a ZIP), or fork the repository to work with in your own Github repo.

Data Quality Assessment and Review

Ranges and Acceptable Numbers

  • Always check to see if the range of the numeric values are reasonable and acceptable
    • 10 point scale - are all 10 levels shown, do they range correctly 0-9 or perhaps 1-10?
    • does the scale start at 0, if not why are there 0’s in your data?
  • are there negative numbers, should there be?
  • are the numbers evenly distributed, should they be
  • should these data be processed numerically or treated as ordinal/categorical levels (e.g. race, income levels, age groups) - does computing the mean make sense? (e.g. LiKert scale)
  • are some of the values outside of your expected scale due to missing data or other coding you need to adjust for prior to analysis

Missing data

  • check that numeric codes for missing have been treated as such
  • how much is missing per variable
  • how much is missing per subject
  • are there any obvious patterns
    • people at end of study - missingness correlated with time
    • other predictors of missing - depressed subjects more likely not to complete study
    • variables/measures with more missing - sensitive items people refuse to answer, income, risky sexual behaviors, …
  • we’ll explore this more in future sessions

Outliers

  • assuming a normal distributions, outliers oftne defined as outside +/- 2 standard deviations or +/- 3 standard deviations
  • but what if the distribution is NOT normal (income, biomarkers, count data)
  • extreme values (LVAD heart transplant cost) - can almost be treated like missing imputation - non-parametrics - bootstrapping - robustized stats
  • illogical values - possible typos

Odd Distributions; Skewed Distributions

  • mathematical transformations (square root, log, others)
  • distributions expected based on response patterns
  • zero-inflated ot hurdle type questions - symptom ratings 0-10, where 0=not present; 1-10 rates severity or frequency (really 2 questions in 1)
  • physical functioning: 0’s indicate cannot perform task, >0 measures how well/how far/how much they did activity (again 2 measures in 1)
  • adherence; knowledge tests - lots of zero’s, lots of 100’s
  • counts - number of children
  • time to event - length of stay; time to readmit

Illogical Values

  • look for typos
  • extreme values (biomarkers, costs, income)
  • income - oftne yearly instead of monthly
  • negative numbers used to indicate missing
  • really high values to indicate missing
  • negative values or odd value occuring after performing date math (i.e. date of event happening before date of birth; date of discharge before date of admission; length of stay in years and you’re expecting days - possible year typo (often occuring in Dec-Jan change over))

mixing of text (strings) and numeric

  • always check to see if the data came in as a number if that what’s you’re expecting
  • in SPSS switch from string to numeric - watch for value deletion insertion of missing values
  • in R check for auto-creation of “factors”

incorrect dates - & data type in general

  • often dates come in as text due to inconsistent formatting
  • issues moving between Excel and other software
  • watch for Excel conversion from date format to number format

Copyright © Melinda Higgins, Ph.D.. All contents under (CC) BY-NC-SA license,CC-BY-NC-SA unless otherwise noted.

Feedback, Comments (email me)?