class: center, middle, inverse, title-slide # Data Wrangling ## Working with Data in R & Introduction to Dplyr ### Melinda Higgins ### Emory University 02/10/2021 --- background-size: 100% background-image: url(img/data_cowboy.png) background-position: 70% 70% class: center, top # Become a Data Wrangler! .footnote[ * Illustrations by Allison Horst, RStudio Artist in Residence, <https://github.com/allisonhorst/stats-illustrations> ] --- # What's in My Dataset? * `str()` see structure of data * `head()` and `tail()` functions * `names()` list variable names in dataset * numerical data - use `summary()` * categorical variable or factors - use `table()` --- # Palmer Penguins Data - Structure 1. Load palmerpenguins package 2. See structure of data ```r library(palmerpenguins) str(penguins) ``` ``` ## tibble [344 x 8] (S3: tbl_df/tbl/data.frame) ## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ... ## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ... ## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ... ## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ... ## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ... ## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ... ``` --- # Palmer Penguins Data - "Peek" at Data 1. See top and bottoms rows of data 2. Use `head()` function ```r head(penguins) ``` ``` ## # A tibble: 6 x 8 ## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Torge~ 39.1 18.7 181 3750 male ## 2 Adelie Torge~ 39.5 17.4 186 3800 fema~ ## 3 Adelie Torge~ 40.3 18 195 3250 fema~ ## 4 Adelie Torge~ NA NA NA NA <NA> ## 5 Adelie Torge~ 36.7 19.3 193 3450 fema~ ## 6 Adelie Torge~ 39.3 20.6 190 3650 male ## # ... with 1 more variable: year <int> ``` Notice we cannot "see" all of the data - some columns hidden? The data is in a `tibble` data "class" ```r class(penguins) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` --- # Look at more rows - bottom 10 rows Add the number 10 to the `tail()` function. ```r tail(penguins, 10) ``` ``` ## # A tibble: 10 x 8 ## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Chinst~ Dream 50.2 18.8 202 3800 ## 2 Chinst~ Dream 45.6 19.4 194 3525 ## 3 Chinst~ Dream 51.9 19.5 206 3950 ## 4 Chinst~ Dream 46.8 16.5 189 3650 ## 5 Chinst~ Dream 45.7 17 195 3650 ## 6 Chinst~ Dream 55.8 19.8 207 4000 ## 7 Chinst~ Dream 43.5 18.1 202 3400 ## 8 Chinst~ Dream 49.6 18.2 193 3775 ## 9 Chinst~ Dream 50.8 19 210 4100 ## 10 Chinst~ Dream 50.2 18.7 198 3775 ## # ... with 2 more variables: sex <fct>, year <int> ``` From `help(tail)` page: ``` ## S3 method for class 'function' tail(x, n = 6L, ...) ``` --- # Convert from a `tibble` to a plain `data.frame` We can convert this to a plain `data.frame` class and then look at the complete data. ```r penguins.df <- data.frame(penguins) class(penguins.df) ``` ``` ## [1] "data.frame" ``` ```r head(penguins.df) ``` ``` ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year ## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 ## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 ## 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007 ## 4 Adelie Torgersen NA NA NA NA <NA> 2007 ## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 ## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 ``` --- # Using `dplyr` - functions are verbs Let's `pull()` out one variable (column) `bill_length_mm` from the `penguins` dataset and then get the `mean()`. We will use the pipe operator `%>%` from `dplyr` to chain our programming steps together in a logical order. ```r penguins %>% pull(bill_length_mm) %>% mean() ``` ``` ## [1] NA ``` Try again - add `na.rm=TRUE` option to the `mean()` function find the mean of all of the non-missing (not `NA`) values. ```r penguins %>% pull(bill_length_mm) %>% mean(na.rm = TRUE) ``` ``` ## [1] 43.92193 ``` .footnote[ Learn more at [https://dplyr.tidyverse.org/](https://dplyr.tidyverse.org/). ] --- # The Base R way to do the same thing ```r mean(penguins$bill_length_mm, na.rm = TRUE) ``` ``` ## [1] 43.92193 ``` -- This is processed similar to the previous slide, but now we have to look from the inside out. -- The `$` operator selects the `bill_length_mm` variable in the `penguins` dataset. -- This is fed into the `mean()` function as the column vector of data we want to find the mean for. -- And then we also add the `na.rm = TRUE` option to the `mean()` function. -- .footnote[I like `dplyr` better. But knowing the base R way is good also.] --- # Choose more than 1 variable with `select()` Let's get `summary()` statistics for `body_mass_g` and `flipper_length_mm` ```r penguins %>% select(body_mass_g, flipper_length_mm) %>% summary() ``` ``` ## body_mass_g flipper_length_mm ## Min. :2700 Min. :172.0 ## 1st Qu.:3550 1st Qu.:190.0 ## Median :4050 Median :197.0 ## Mean :4202 Mean :200.9 ## 3rd Qu.:4750 3rd Qu.:213.0 ## Max. :6300 Max. :231.0 ## NA's :2 NA's :2 ``` The `summary()` function is a useful base R function - I just wish it gave you standard deviation!!! --- # Another "summary" stat function - Hmisc::describe() ```r library(Hmisc) penguins %>% select(body_mass_g, flipper_length_mm) %>% Hmisc::describe() ``` ``` ## . ## ## 2 Variables 344 Observations ## ------------------------------------------------------------------------------------------------------------------------ ## body_mass_g ## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95 ## 342 2 94 1 4202 911.8 3150 3300 3550 4050 4750 5400 5650 ## ## lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300 ## ------------------------------------------------------------------------------------------------------------------------ ## flipper_length_mm ## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95 ## 342 2 55 0.999 200.9 16.03 181.0 185.0 190.0 197.0 213.0 220.9 225.0 ## ## lowest : 172 174 176 178 179, highest: 226 228 229 230 231 ## ------------------------------------------------------------------------------------------------------------------------ ``` .footnote[Notice the use of `Hmisc::describe()` - the "describe" function has the same name in both the `Hmisc` and `psych` packages. So we use the syntax `packagename::function()` to avoid confusion.] --- # Also try psych::describe() for summary stats ```r library(psych) penguins %>% select(body_mass_g, flipper_length_mm) %>% psych::describe() ``` ``` ## vars n mean sd median trimmed mad min max range skew kurtosis se ## body_mass_g 1 342 4201.75 801.95 4050 4154.01 889.56 2700 6300 3600 0.47 -0.74 43.36 ## flipper_length_mm 2 342 200.92 14.06 197 200.34 16.31 172 231 59 0.34 -1.00 0.76 ``` .footnote[Notice the use of `psych::describe()` - the "describe" function has the same name in both the `Hmisc` and `psych` packages. So we use the syntax `packagename::function()` to avoid confusion.] --- # Create custom statistics output The `summarise()` function from `dplyr` will let you create custom statistical summaries. ```r penguins %>% select(body_mass_g, flipper_length_mm) %>% summarise( body_mean = mean(body_mass_g, na.rm=TRUE), body_sd = sd(body_mass_g, na.rm=TRUE), flip_mean = mean(flipper_length_mm, na.rm=TRUE), flip_sd = sd(flipper_length_mm, na.rm=TRUE) ) ``` ``` ## # A tibble: 1 x 4 ## body_mean body_sd flip_mean flip_sd ## <dbl> <dbl> <dbl> <dbl> ## 1 4202. 802. 201. 14.1 ``` --- # Use `group_by()` to further improve your output Let's get the same output but now grouped by `species.` Notice we added `species` to the `select()` list of variables. ```r penguins %>% select(body_mass_g, flipper_length_mm, species) %>% * group_by(species) %>% summarise( body_mean = mean(body_mass_g, na.rm=TRUE), body_sd = sd(body_mass_g, na.rm=TRUE), flip_mean = mean(flipper_length_mm, na.rm=TRUE), flip_sd = sd(flipper_length_mm, na.rm=TRUE) ) ``` ``` ## # A tibble: 3 x 5 ## species body_mean body_sd flip_mean flip_sd ## * <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Adelie 3701. 459. 190. 6.54 ## 2 Chinstrap 3733. 384. 196. 7.13 ## 3 Gentoo 5076. 504. 217. 6.48 ``` --- # Use `knitr::kable()` for pretty output table Use `knitr::kable()` on any table, data.frame, matrix or tibble to get pretty output. ```r penguins %>% select(body_mass_g, flipper_length_mm, species) %>% group_by(species) %>% summarise( body_mean = mean(body_mass_g, na.rm=TRUE), body_sd = sd(body_mass_g, na.rm=TRUE), flip_mean = mean(flipper_length_mm, na.rm=TRUE), flip_sd = sd(flipper_length_mm, na.rm=TRUE) ) %>% * knitr::kable() ``` |species | body_mean| body_sd| flip_mean| flip_sd| |:---------|---------:|--------:|---------:|--------:| |Adelie | 3700.662| 458.5661| 189.9536| 6.539457| |Chinstrap | 3733.088| 384.3351| 195.8235| 7.131894| |Gentoo | 5076.016| 504.1162| 217.1870| 6.484976| --- # `knitr::kable()` - add better column headings Use `knitr::kable()` on any table, data.frame, matrix or tibble to get pretty output. ```r penguins %>% select(body_mass_g, flipper_length_mm, species) %>% group_by(species) %>% summarise( body_mean = mean(body_mass_g, na.rm=TRUE), body_sd = sd(body_mass_g, na.rm=TRUE), flip_mean = mean(flipper_length_mm, na.rm=TRUE), flip_sd = sd(flipper_length_mm, na.rm=TRUE) ) %>% * knitr::kable(col.names = c("Species", * "Mean Body Mass (g)", * "SD of Body Mass (g)", * "Mean Flipper Length (mm)", * "SD of Flipper Length (mm)")) ``` |Species | Mean Body Mass (g)| SD of Body Mass (g)| Mean Flipper Length (mm)| SD of Flipper Length (mm)| |:---------|------------------:|-------------------:|------------------------:|-------------------------:| |Adelie | 3700.662| 458.5661| 189.9536| 6.539457| |Chinstrap | 3733.088| 384.3351| 195.8235| 7.131894| |Gentoo | 5076.016| 504.1162| 217.1870| 6.484976| --- <img src="img/dplyr_across.png" width="=100%" /> --- # `dplyr::across()` ```r penguins %>% dplyr::group_by(species) %>% * dplyr::summarize(across(ends_with("mm"), mean, na.rm = TRUE)) %>% knitr::kable() ``` |species | bill_length_mm| bill_depth_mm| flipper_length_mm| |:---------|--------------:|-------------:|-----------------:| |Adelie | 38.79139| 18.34636| 189.9536| |Chinstrap | 48.83382| 18.42059| 195.8235| |Gentoo | 47.50488| 14.98211| 217.1870| -- Besides `ends_with()` also explore: * `starts_with()` * `contains()` * `matches()` * learn more about `tidyselect` package [https://tidyselect.r-lib.org/](https://tidyselect.r-lib.org/) (loaded with `dplyr`) **NOTE:** The code above added the packagename syntax `dplyr::group_by()` and `dplyr::summarise()` to avoid confusion with `Hmisc::summarize()`. .footnote[Learn more at [https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome)] --- <img src="img/dplyr_filter.jpg" width="=100%" /> --- # Using `dplyr::filter()` to select cases (rows) Get summary statistics for the Chinstrap penguins ```r penguins %>% * filter(species == "Chinstrap") %>% summary() ``` ``` ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex ## Adelie : 0 Biscoe : 0 Min. :40.90 Min. :16.40 Min. :178.0 Min. :2700 female:34 ## Chinstrap:68 Dream :68 1st Qu.:46.35 1st Qu.:17.50 1st Qu.:191.0 1st Qu.:3488 male :34 ## Gentoo : 0 Torgersen: 0 Median :49.55 Median :18.45 Median :196.0 Median :3700 ## Mean :48.83 Mean :18.42 Mean :195.8 Mean :3733 ## 3rd Qu.:51.08 3rd Qu.:19.40 3rd Qu.:201.0 3rd Qu.:3950 ## Max. :58.00 Max. :20.80 Max. :212.0 Max. :4800 ## year ## Min. :2007 ## 1st Qu.:2007 ## Median :2008 ## Mean :2008 ## 3rd Qu.:2009 ## Max. :2009 ``` --- <img src="img/dplyr_mutate.png" width="75%" /> --- # `dplyr::mutate()` to make new variables The body_mass_g is currently in grams. Let's create a new variable for the mass in kg. Save the updated dataset as `penguins.mod`. ```r penguins.mod <- penguins %>% mutate(body_mass_kg = body_mass_g / 1000) penguins.mod %>% select(body_mass_g, body_mass_kg) %>% head() ``` ``` ## # A tibble: 6 x 2 ## body_mass_g body_mass_kg ## <int> <dbl> ## 1 3750 3.75 ## 2 3800 3.8 ## 3 3250 3.25 ## 4 NA NA ## 5 3450 3.45 ## 6 3650 3.65 ``` --- <img src="img/dplyr_case_when.png" width="75%" /> --- # Making categories or recoding - dplyr::case_when() .pull-left[ ```r # Use case_when() to create # three size categories penguins_mod <- penguins %>% mutate(size_bin = case_when( body_mass_g > 4500 ~ "large", body_mass_g > 3000 & body_mass_g <= 4500 ~ "medium", body_mass_g <= 3000 ~ "small" ) ) # Create summary table # of size categories penguins_mod %>% pull(size_bin) %>% * table(useNA = "ifany") %>% knitr::kable( col.names = c("Size Category", "Frequency")) ``` ] .pull-right[ ### Table of Body Size Categories |Size Category | Frequency| |:-------------|---------:| |large | 115| |medium | 216| |small | 11| |NA | 2| ] .footnote[Notice `useNA = "ifany"` option added to `table()` to print the number of missing `NA`'s] --- # Quick Table Example - Arsenal Package ```r library(arsenal) tab1 <- tableby(island ~ bill_length_mm + bill_depth_mm, data = penguins) summary(tab1) ``` | | Biscoe (N=168) | Dream (N=124) | Torgersen (N=52) | Total (N=344) | p value| |:---------------------------|:---------------:|:---------------:|:----------------:|:---------------:|-------:| |**bill_length_mm** | | | | | < 0.001| | N-Miss | 1 | 0 | 1 | 2 | | | Mean (SD) | 45.257 (4.773) | 44.168 (5.954) | 38.951 (3.025) | 43.922 (5.460) | | | Range | 34.500 - 59.600 | 32.100 - 58.000 | 33.500 - 46.000 | 32.100 - 59.600 | | |**bill_depth_mm** | | | | | < 0.001| | N-Miss | 1 | 0 | 1 | 2 | | | Mean (SD) | 15.875 (1.821) | 18.344 (1.133) | 18.429 (1.339) | 17.151 (1.975) | | | Range | 13.100 - 21.100 | 15.500 - 21.200 | 15.900 - 21.500 | 13.100 - 21.500 | | .footnote[Learn more about the `arsenal` package and `tableby()` function at [https://cran.r-project.org/web/packages/arsenal/vignettes/tableby.html](https://cran.r-project.org/web/packages/arsenal/vignettes/tableby.html).]