Data Cleaning and Assessment

Reviewing the dataset - for Homework 01

In today’s class we’ll continuing exploring and working with the issues and problems with the dataset you’ll be working with for Homework 01. All of the files are uploaded into this Github repository https://github.com/melindahiggins2000/N736Homework01. Feel free to download the files (as a ZIP), or fork the repository to work with in your own Github repo.

Data Quality Assessment and Review

Ranges and Acceptable Numbers

Always check to see if the range of the numeric values are reasonable and acceptable
- 10 point scale - are all 10 levels shown, do they range correctly 0-9 or perhaps 1-10?
- does the scale start at 0, if not why are there 0’s in your data?
are there negative numbers, should there be?
are the numbers evenly distributed, should they be
should these data be processed numerically or treated as ordinal/categorical levels (e.g. race, income levels, age groups) - does computing the mean make sense? (e.g. LiKert scale)
are some of the values outside of your expected scale due to missing data or other coding you need to adjust for prior to analysis

Missing data

check that numeric codes for missing have been treated as such
how much is missing per variable
how much is missing per subject
are there any obvious patterns
- people at end of study - missingness correlated with time
- other predictors of missing - depressed subjects more likely not to complete study
- variables/measures with more missing - sensitive items people refuse to answer, income, risky sexual behaviors, …
we’ll explore this more in future sessions

Outliers

assuming a normal distributions, outliers oftne defined as outside +/- 2 standard deviations or +/- 3 standard deviations
but what if the distribution is NOT normal (income, biomarkers, count data)
extreme values (LVAD heart transplant cost) - can almost be treated like missing imputation - non-parametrics - bootstrapping - robustized stats
illogical values - possible typos

Odd Distributions; Skewed Distributions

mathematical transformations (square root, log, others)
distributions expected based on response patterns
zero-inflated ot hurdle type questions - symptom ratings 0-10, where 0=not present; 1-10 rates severity or frequency (really 2 questions in 1)
physical functioning: 0’s indicate cannot perform task, >0 measures how well/how far/how much they did activity (again 2 measures in 1)
adherence; knowledge tests - lots of zero’s, lots of 100’s
counts - number of children
time to event - length of stay; time to readmit

Illogical Values

look for typos
extreme values (biomarkers, costs, income)
income - oftne yearly instead of monthly
negative numbers used to indicate missing
really high values to indicate missing
negative values or odd value occuring after performing date math (i.e. date of event happening before date of birth; date of discharge before date of admission; length of stay in years and you’re expecting days - possible year typo (often occuring in Dec-Jan change over))

mixing of text (strings) and numeric

always check to see if the data came in as a number if that what’s you’re expecting
in SPSS switch from string to numeric - watch for value deletion insertion of missing values
in R check for auto-creation of “factors”

incorrect dates - & data type in general

often dates come in as text due to inconsistent formatting
issues moving between Excel and other software
watch for Excel conversion from date format to number format

Feedback, Comments (email me)?

Data Cleaning and Assessment - Part II