Data Review and Quality Assessment

Ranges and Acceptable Numbers

  • Always check to see if the range of the numeric values are reasonable and acceptable
    • 10 point scale - are all 10 levels shown & how are they coded: is the range 0-9 or perhaps 1-10? [what are the advantages or disadvantages of starting at 0 versus starting at 1?]
    • does the scale start at 0, if not why are there 0’s in your data?
    • ZEROS are IMPORTANT and should always be investigated and confirmed
  • are there negative numbers, should there be?
  • are the numbers evenly distributed, should they be?
  • should these data be processed numerically or treated as ordinal/categorical levels (e.g. race, income levels, age groups) - does computing the mean make sense? (e.g. Likert scale)
  • are some of the values outside of your expected scale due to missing data or other coding you need to adjust for prior to analysis

Missing data

  • check that numeric codes for missing have been treated as such
  • how much is missing per variable
  • how much is missing per subject
  • are there any obvious patterns
    • people at end of study - missingness correlated with time
    • other predictors of missing - depressed subjects more likely not to complete study
    • variables/measures with more missing - sensitive items people refuse to answer, income, risky sexual behaviors, …
  • we’ll explore this more in future sessions
  • REMEMBER MISSING DATA IS DATA!
  • Pros and Cons of Missing Data imputation
    • bias in the presence of missing data
    • bias introduced by imputation

Outliers (aka. “extreme” values)

  • assuming a normal distributions, outliers often defined as outside +/- 2 standard deviations or +/- 3 standard deviations
  • but what if the distribution is NOT normal (income, biomarkers, count data)
  • extreme values (LVAD heart transplant cost) - can almost be treated like missing imputation - non-parametrics - bootstrapping - robustized stats
  • illogical values - possible typos

Odd Distributions; Skewed Distributions

  • mathematical transformations
    • square root, log, others (useful for right skewed tailed distributions)
    • left skewed distributions use inverse (REF - value)
    • approaches used to help “noramlize” the distribution
    • pros and cons - nice distribution but complicated to interpret
  • distributions expected based on response patterns
    • possible bimodal or other clusters defined by other factors like gender, race, etc.
  • zero-inflated or hurdle type questions
    • symptom ratings 0-10, where 0=not present;
    • 1-10 rates severity or frequency (really 2 questions in 1)
  • physical functioning:
    • 0’s indicate cannot perform task,
    • >0 measures how well/how far/how much they did activity
    • again 2 measures/questions in 1
  • adherence & knowledge tests
    • lots of zero’s, lots of 100’s
    • will often have floor or ceiling effects
    • could be bimodal
  • counts (“poisson” distribution)
    • number of children;
    • days in hospital (length of stay);
    • even Age
  • time to event (“gamma” related distributions)
    • time to disease onset or health event (heart attack, stroke)
    • time to readmit

Illogical Values

  • look for typos (2.6 versus 6.2; 11 instead of 1; 40 instead of 4)
  • unit mismatches (inches instead of cm; feet vs meters; kg versus pounds)
  • extreme values (biomarkers, costs, income)
  • income
    • often yearly instead of monthly
    • be sure common reporting units are used
  • negative numbers used to indicate missing
  • really high values to indicate missing
  • negative values or odd value occurring after performing date math (i.e. date of event happening before date of birth; date of discharge before date of admission; length of stay in years and you’re expecting days - possible year typo (often occurring in Dec-Jan change over))

mixing of text (strings) and numeric

  • always check to see if the data came in as a number if that what’s you’re expecting
    • especially if the imported data accidently had text somewhere in the column
    • for example, labs often report low values below detection as <x.xx or high values above the saturation limit of the detector as >x.xx. Look at for these leading (or following) text characters before or after the number of interest
  • in SPSS switch from string to numeric - watch for value deletion insertion of missing values
  • in R check for auto-creation of “factors”

incorrect dates - & data type in general


Copyright © Melinda Higgins, Ph.D.. All contents under (CC) BY-NC-SA license,CC-BY-NC-SA unless otherwise noted.

Feedback, Comments (email me)?