Text-to-Columns; Search Across Columns; Parse FREE OPEN Text
Author
Melinda Higgins
Published
March 16, 2023
Overview
The code and data presented below will hopefully help you get some experience working with TEXT data in R. The materials below focus on 2 skills with 2 examples for each:
Skill 1: Separating text into columns
Example 1: Get the make and model of cars (mtcars built-in dataset)
Example 2: Extracting “data” from filenames (e.g. id, visit, etc)
Skill 2: Working with FREE/OPEN test - Searching for text - parsing into categories
Example 3: Working with messy list of college courses
arsenal - (optional) for table formatting and organizing output
Skill 1: Splitting text into separate columns
There is a function in EXCEL under the “DATA” tab for “test-to-columns” allowing you to designate a “delimiter” for splitting text chunks into separate columns. There is a similar function in the tidyr package, separate(). Let’s see an example of how this works.
Example 1: Make and Model of Cars in mtcars dataset
Let’s take a look at the built-in mtcars dataset. This dataset has “row names” for each car’s make and model. Here is an example of the top 6 rows of the mtcars dataset:
Show/Hide Code
# view top 6 rows of mtcars datasetmtcars %>%head() %>% knitr::kable(caption ="Top 6 rows of mtcars dataset")
Suppose we now want to break up the make and model into separate columns using the space as our column divider. We can use the separate() function from tidyr package to do this. Note: given the full list of makes and models some have 2 spaces so you’ll end up with 3 columns that we’ll call “make”, “model” and “type” which is why into = c("make", "model", "type") in the code below. This defines the new columns we are adding to the dataset.
The options below are as follows:
data = name of data frame
col = column you want to separate apart (in this case, character)
sep = character expression to match for separating
into = the list of new column variables you want to create
remove = whether you want to keep or remove the rest of the variables in the data frame.
Numeric variables can also be separated, see more details in the help manual for tidyr::separate().
Here is a small hypothetical dataset from a lab that created custom IDs to track the subject, visit number and year by combining them into one long “string” (text field) separated by underscores “_“. This is the variable idlong in the labdata dataset (created in code below).
Using the code example above, here is another application of the tifyr::separate() function to separate the long string idlong into 3 new columns added to the labdata dataset individually for “ID”, “visit” and “year”.
# read in datasetlibrary(readr)school_courses <-read_csv("school_courses.csv")# view dataset in browser with DT package# adds scroll bars and "next" page tabbinglibrary(DT)datatable(school_courses, options =list(pageLength =5, autoWidth =TRUE))
Let’s create indicators for different course categories using sets of keywords under each course type. For example, let’s build indicators for:
English
Writing
Composition
Literature
Critical Thinking
Written Expression
Creative Arts
Communication
Literary
Rhetoric
reading
written communication
Statistics
Quantitative Reasoning
biostatistics
statistics
Fitness - this does NOT include “nutrition” nor “nutrition for wellness”
Health and Fitness
Wellness
Physical Education
Nutrition - run as a separate category
nutrition
nutrition for wellness
Explaining the code below
mutate() from dplyr package used to create new variables in dataset
if_any() also from dplyr package used to select multiple columns “across” which to “apply” a given function. See “colwise” vignette for dplyr.
.cols = is a list of columns or variables
starts_with() is a “helpful” function from tidyselect package, loaded with tidyr.
.fns = could be any function like mean(), but here I’m using a purrr style ~ to “map” a function across the columns specified; see more details for dplyr::across().
tolower() is a base R function that sets the character string specified to all lowercase letters. The syntax here tolower(.) takes the “strings” coming in from the “course” columns and feeds them . into tolower().
Show/Hide Code
# load stringr for str_detect() functionlibrary(stringr)# look across all of the columns that start with "course"# look for the word "english" in any of these columns# to avoid capitalization issues, use tolower() functionschool_courses <- school_courses %>%mutate(englishYN =if_any(.cols =starts_with("course"),.fns =~str_detect(tolower(.), "english")))# add another course to listschool_courses <- school_courses %>%mutate(writingYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "writing")))school_courses %>%mutate(aa =rowSums(across(c(englishYN,writingYN)),na.rm =TRUE)) %>%select(school, englishYN, writingYN, aa)
# A tibble: 10 × 4
school englishYN writingYN aa
<dbl> <lgl> <lgl> <dbl>
1 1 FALSE FALSE 0
2 2 TRUE NA 1
3 3 NA TRUE 1
4 4 TRUE NA 1
5 5 TRUE TRUE 2
6 6 TRUE NA 1
7 7 TRUE NA 1
8 8 TRUE NA 1
9 9 TRUE NA 1
10 10 TRUE NA 1
Show/Hide Code
# create indicator variable for any school# with either an "english" or "writing" course or bothschool_courses <- school_courses %>%mutate(engwrit01 =as.numeric(rowSums(across(c(englishYN,writingYN)),na.rm =TRUE) >0)) school_courses %>%select(school, englishYN, writingYN, engwrit01)
# A tibble: 10 × 4
school englishYN writingYN engwrit01
<dbl> <lgl> <lgl> <dbl>
1 1 FALSE FALSE 0
2 2 TRUE NA 1
3 3 NA TRUE 1
4 4 TRUE NA 1
5 5 TRUE TRUE 1
6 6 TRUE NA 1
7 7 TRUE NA 1
8 8 TRUE NA 1
9 9 TRUE NA 1
10 10 TRUE NA 1
Notice that:
School 1 has something listed in all 10 course listings and none have the word “english” in them, so you get a value of FALSE or 0.
But School 3 only has data in 12 columns, the last 6 are empty. None of these 12 columns had the word “english” and was also missing data in the last columns which is why you get a value of NA.
And the rest of the schools have at least 1 column with the word “english” in it.
There are similar results for the “writing” courses.
The final column shows a 1 if the school has either “english”, “writing” or both or shows a 0 if they have neither.
WARNING
I should note that when I wrote this code I did not care if there was more than 1 course with a given subject (like English 101 and English 102), I only cared whether the course showed up at least once in the list. You may need to update my code if you care about accounting for columns with missing data.
Rest of code to parse rest of list for “English” and “Statistics”
Show/Hide Code
# create TRUE FALSE for YES/NO for each of these key words# and phases to look for:school_courses <- school_courses %>%mutate(englishYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "english"))) %>%mutate(writingYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "writing"))) %>%mutate(compositionYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "composition"))) %>%mutate(literatureYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "literature"))) %>%mutate(criticalThinkingYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "critical thinking"))) %>%mutate(writtenExcourseessionYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "written excourseession"))) %>%mutate(creativeArtsYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "creative arts"))) %>%mutate(communicationYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "communication"))) %>%mutate(literaryYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "literary"))) %>%mutate(rhetoricYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "rhetoric"))) %>%mutate(readingYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "reading"))) %>%mutate(writtenCommunicationYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "written communication"))) %>%# now add up all the TRUE (as 1) and FALSE (as 0)# notice I added na.rm=TRUE so the NAs are ignored# and I used as.numeric(xxx > 0) to mutate(english01 =as.numeric(rowSums(across(c( englishYN, writingYN, compositionYN, literatureYN, criticalThinkingYN, writtenExcourseessionYN, creativeArtsYN, communicationYN, literaryYN, rhetoricYN, readingYN, writtenCommunicationYN ) ),na.rm =TRUE) >0)) %>%mutate(statisticsYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "statistics"))) %>%mutate(biostatisticsYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "biostatistics"))) %>%mutate(quantitativeYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "quantitative"))) %>%mutate(quantitativeReasoningYN =if_any(.cols =starts_with("course"),~str_detect(tolower(.), "quantitative reasoning"))) %>%mutate(stat01 =as.numeric(rowSums(across(c( statisticsYN, biostatisticsYN, quantitativeYN, quantitativeReasoningYN ) ),na.rm =TRUE) >0))
What about “Wellness” versus “Nutrition and Wellness”?
^ to match the start of the string
(?=.*wellness) the string should start with something with “wellness” in it
(?!.*nutrition for wellness) but should NOT have “nutrition for wellness”
library(arsenal)# add labels for variables in c1attr(c1$school, 'label') <-'School ID'attr(c1$engwrit, 'label') <-'English or Writing'attr(c1$english, 'label') <-'English'attr(c1$stat, 'label') <-'Statistics'attr(c1$fitness, 'label') <-'Health, Fitness, Wellness & Physical Education'attr(c1$nutrition, 'label') <-'Nutrition (including Nutrition and Wellness)'# create a function to make 0/1 into "no"/"yes" factor# set 0=no, 1=yes and make as factorfactoryn <-function(.x) {return(factor(.x,level =c(0, 1),label =c("no", "yes"))) }# use purrr package to map this function# across all of the 0/1 variables# to turn them into "no"/"yes" factor typec1yn <- c1 %>%select(all_of(coursenames)) %>% purrr::map(factoryn) %>%data.frame()tab1 <-tableby( ~ .,numeric.stats =c("median", "q1q3", "range", "Nmiss"),data = c1yn)summary( tab1,test =FALSE,pfootnote =TRUE,digits =1,digits.pct =1,title ="Course Frequencies for 10 Schools")
Course Frequencies for 10 Schools
Overall (N=10)
engwrit
no
1 (10.0%)
yes
9 (90.0%)
english
no
0 (0.0%)
yes
10 (100.0%)
stat
no
5 (50.0%)
yes
5 (50.0%)
fitness
no
10 (100.0%)
yes
0 (0.0%)
nutrition
no
5 (50.0%)
yes
5 (50.0%)
Example 4: Parsing a list of medications into treatment classes