Data Wrangling

class: center, middle, inverse, title-slide

# Data Wrangling
## Working with Data in R & Introduction to Dplyr
### Melinda Higgins
### Emory University </br> 02/10/2021

---

background-size: 100%
background-image: url(img/data_cowboy.png)
background-position: 70% 70%
class: center, top

# Become a Data Wrangler!

.footnote[
* Illustrations by Allison Horst, RStudio Artist in Residence, <https://github.com/allisonhorst/stats-illustrations>
]

---

# What's in My Dataset?

* `str()` see structure of data
* `head()` and `tail()` functions
* `names()` list variable names in dataset
* numerical data - use `summary()`
* categorical variable or factors - use `table()`

---

# Palmer Penguins Data - Structure

1. Load palmerpenguins package
2. See structure of data

```r
library(palmerpenguins)
str(penguins)
```

```
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
```

---

# Palmer Penguins Data - "Peek" at Data

1. See top and bottoms rows of data
2. Use `head()` function

```r
head(penguins)
```

```
## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           39.1          18.7              181        3750 male 
## 2 Adelie  Torge~           39.5          17.4              186        3800 fema~
## 3 Adelie  Torge~           40.3          18                195        3250 fema~
## 4 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge~           36.7          19.3              193        3450 fema~
## 6 Adelie  Torge~           39.3          20.6              190        3650 male 
## # ... with 1 more variable: year <int>
```

Notice we cannot "see" all of the data - some columns hidden? The data is in a `tibble` data "class"

```r
class(penguins)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

---

# Look at more rows - bottom 10 rows

Add the number 10 to the `tail()` function.

```r
tail(penguins, 10)
```

```
## # A tibble: 10 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Chinst~ Dream            50.2          18.8              202        3800
##  2 Chinst~ Dream            45.6          19.4              194        3525
##  3 Chinst~ Dream            51.9          19.5              206        3950
##  4 Chinst~ Dream            46.8          16.5              189        3650
##  5 Chinst~ Dream            45.7          17                195        3650
##  6 Chinst~ Dream            55.8          19.8              207        4000
##  7 Chinst~ Dream            43.5          18.1              202        3400
##  8 Chinst~ Dream            49.6          18.2              193        3775
##  9 Chinst~ Dream            50.8          19                210        4100
## 10 Chinst~ Dream            50.2          18.7              198        3775
## # ... with 2 more variables: sex <fct>, year <int>
```

From `help(tail)` page:

```
## S3 method for class 'function'
tail(x, n = 6L, ...)
```
---

# Convert from a `tibble` to a plain `data.frame`

We can convert this to a plain `data.frame` class and then look at the complete data.

```r
penguins.df <- data.frame(penguins)
class(penguins.df)
```

```
## [1] "data.frame"
```

```r
head(penguins.df)
```

```
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g    sex year
## 1  Adelie Torgersen           39.1          18.7               181        3750   male 2007
## 2  Adelie Torgersen           39.5          17.4               186        3800 female 2007
## 3  Adelie Torgersen           40.3          18.0               195        3250 female 2007
## 4  Adelie Torgersen             NA            NA                NA          NA   <NA> 2007
## 5  Adelie Torgersen           36.7          19.3               193        3450 female 2007
## 6  Adelie Torgersen           39.3          20.6               190        3650   male 2007
```

---

# Using `dplyr` - functions are verbs

Let's `pull()` out one variable (column) `bill_length_mm` from the `penguins` dataset and then get the `mean()`.

We will use the pipe operator `%>%` from `dplyr` to chain our programming steps together in a logical order.

```r
penguins %>%
  pull(bill_length_mm) %>%
  mean()
```

```
## [1] NA
```

Try again - add `na.rm=TRUE` option to the `mean()` function find the mean of all of the non-missing (not `NA`) values.

```r
penguins %>%
  pull(bill_length_mm) %>%
  mean(na.rm = TRUE)
```

```
## [1] 43.92193
```

.footnote[
Learn more at [https://dplyr.tidyverse.org/](https://dplyr.tidyverse.org/).
]

---

# The Base R way to do the same thing

```r
mean(penguins$bill_length_mm, na.rm = TRUE)
```

```
## [1] 43.92193
```

This is processed similar to the previous slide, but now we have to look from the inside out.

The `$` operator selects the `bill_length_mm` variable in the `penguins` dataset.

This is fed into the `mean()` function as the column vector of data we want to find the mean for.

And then we also add the `na.rm = TRUE` option to the `mean()` function.

.footnote[I like `dplyr` better. But knowing the base R way is good also.]

---

# Choose more than 1 variable with `select()`

Let's get `summary()` statistics for `body_mass_g` and `flipper_length_mm`

```r
penguins %>%
  select(body_mass_g, flipper_length_mm) %>%
  summary()
```

```
##   body_mass_g   flipper_length_mm
##  Min.   :2700   Min.   :172.0    
##  1st Qu.:3550   1st Qu.:190.0    
##  Median :4050   Median :197.0    
##  Mean   :4202   Mean   :200.9    
##  3rd Qu.:4750   3rd Qu.:213.0    
##  Max.   :6300   Max.   :231.0    
##  NA's   :2      NA's   :2
```

The `summary()` function is a useful base R function - I just wish it gave you standard deviation!!!

---

# Another "summary" stat function - Hmisc::describe()

```r
library(Hmisc)

penguins %>%
  select(body_mass_g, flipper_length_mm) %>%
  Hmisc::describe()
```

```
## . 
## 
##  2  Variables      344  Observations
## ------------------------------------------------------------------------------------------------------------------------
## body_mass_g 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
##      342        2       94        1     4202    911.8     3150     3300     3550     4050     4750     5400     5650 
## 
## lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300
## ------------------------------------------------------------------------------------------------------------------------
## flipper_length_mm 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
##      342        2       55    0.999    200.9    16.03    181.0    185.0    190.0    197.0    213.0    220.9    225.0 
## 
## lowest : 172 174 176 178 179, highest: 226 228 229 230 231
## ------------------------------------------------------------------------------------------------------------------------
```

.footnote[Notice the use of `Hmisc::describe()` - the "describe" function has the same name in both the `Hmisc` and `psych` packages. So we use the syntax `packagename::function()` to avoid confusion.]

---

# Also try psych::describe() for summary stats

```r
library(psych)

penguins %>%
  select(body_mass_g, flipper_length_mm) %>%
  psych::describe()
```

```
##                   vars   n    mean     sd median trimmed    mad  min  max range skew kurtosis    se
## body_mass_g          1 342 4201.75 801.95   4050 4154.01 889.56 2700 6300  3600 0.47    -0.74 43.36
## flipper_length_mm    2 342  200.92  14.06    197  200.34  16.31  172  231    59 0.34    -1.00  0.76
```

.footnote[Notice the use of `psych::describe()` - the "describe" function has the same name in both the `Hmisc` and `psych` packages. So we use the syntax `packagename::function()` to avoid confusion.]

---

# Create custom statistics output

The `summarise()` function from `dplyr` will let you create custom statistical summaries.

```r
penguins %>%
  select(body_mass_g, flipper_length_mm) %>%
  summarise(
    body_mean = mean(body_mass_g, na.rm=TRUE),
    body_sd = sd(body_mass_g, na.rm=TRUE),
    flip_mean = mean(flipper_length_mm, na.rm=TRUE),
    flip_sd = sd(flipper_length_mm, na.rm=TRUE)
  )
```

```
## # A tibble: 1 x 4
##   body_mean body_sd flip_mean flip_sd
##       <dbl>   <dbl>     <dbl>   <dbl>
## 1     4202.    802.      201.    14.1
```

---

# Use `group_by()` to further improve your output

Let's get the same output but now grouped by `species.` Notice we added `species` to the `select()` list of variables.

```r
penguins %>%
  select(body_mass_g, flipper_length_mm, species) %>%
* group_by(species) %>%
  summarise(
    body_mean = mean(body_mass_g, na.rm=TRUE),
    body_sd = sd(body_mass_g, na.rm=TRUE),
    flip_mean = mean(flipper_length_mm, na.rm=TRUE),
    flip_sd = sd(flipper_length_mm, na.rm=TRUE)
  )
```

```
## # A tibble: 3 x 5
##   species   body_mean body_sd flip_mean flip_sd
## * <fct>         <dbl>   <dbl>     <dbl>   <dbl>
## 1 Adelie        3701.    459.      190.    6.54
## 2 Chinstrap     3733.    384.      196.    7.13
## 3 Gentoo        5076.    504.      217.    6.48
```

---

# Use `knitr::kable()` for pretty output table

Use `knitr::kable()` on any table, data.frame, matrix or tibble to get pretty output.

|species   | body_mean|  body_sd| flip_mean|  flip_sd|
|:---------|---------:|--------:|---------:|--------:|
|Adelie    |  3700.662| 458.5661|  189.9536| 6.539457|
|Chinstrap |  3733.088| 384.3351|  195.8235| 7.131894|
|Gentoo    |  5076.016| 504.1162|  217.1870| 6.484976|

---

# `knitr::kable()` - add better column headings

Use `knitr::kable()` on any table, data.frame, matrix or tibble to get pretty output.

```r
penguins %>%
  select(body_mass_g, flipper_length_mm, species) %>%
  group_by(species) %>%
  summarise(
    body_mean = mean(body_mass_g, na.rm=TRUE),
    body_sd = sd(body_mass_g, na.rm=TRUE),
    flip_mean = mean(flipper_length_mm, na.rm=TRUE),
    flip_sd = sd(flipper_length_mm, na.rm=TRUE)
  ) %>%
* knitr::kable(col.names = c("Species",
*                            "Mean Body Mass (g)",
*                            "SD of Body Mass (g)",
*                            "Mean Flipper Length (mm)",
*                            "SD of Flipper Length (mm)"))
```

|Species   | Mean Body Mass (g)| SD of Body Mass (g)| Mean Flipper Length (mm)| SD of Flipper Length (mm)|
|:---------|------------------:|-------------------:|------------------------:|-------------------------:|
|Adelie    |           3700.662|            458.5661|                 189.9536|                  6.539457|
|Chinstrap |           3733.088|            384.3351|                 195.8235|                  7.131894|
|Gentoo    |           5076.016|            504.1162|                 217.1870|                  6.484976|

---

---

# `dplyr::across()`

```r
penguins %>%
  dplyr::group_by(species) %>%
* dplyr::summarize(across(ends_with("mm"), mean, na.rm = TRUE)) %>%
  knitr::kable()
```

|species   | bill_length_mm| bill_depth_mm| flipper_length_mm|
|:---------|--------------:|-------------:|-----------------:|
|Adelie    |       38.79139|      18.34636|          189.9536|
|Chinstrap |       48.83382|      18.42059|          195.8235|
|Gentoo    |       47.50488|      14.98211|          217.1870|

Besides `ends_with()` also explore:

* `starts_with()`
* `contains()`
* `matches()`
* learn more about `tidyselect` package [https://tidyselect.r-lib.org/](https://tidyselect.r-lib.org/) (loaded with `dplyr`)

**NOTE:** The code above added the packagename syntax `dplyr::group_by()` and `dplyr::summarise()` to avoid confusion with `Hmisc::summarize()`.

.footnote[Learn more at [https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome](https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome)]

---

---

# Using `dplyr::filter()` to select cases (rows)

Get summary statistics for the Chinstrap penguins

```r
penguins %>%
* filter(species == "Chinstrap") %>%
  summary()
```

```
##       species         island   bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g       sex    
##  Adelie   : 0   Biscoe   : 0   Min.   :40.90   Min.   :16.40   Min.   :178.0     Min.   :2700   female:34  
##  Chinstrap:68   Dream    :68   1st Qu.:46.35   1st Qu.:17.50   1st Qu.:191.0     1st Qu.:3488   male  :34  
##  Gentoo   : 0   Torgersen: 0   Median :49.55   Median :18.45   Median :196.0     Median :3700              
##                                Mean   :48.83   Mean   :18.42   Mean   :195.8     Mean   :3733              
##                                3rd Qu.:51.08   3rd Qu.:19.40   3rd Qu.:201.0     3rd Qu.:3950              
##                                Max.   :58.00   Max.   :20.80   Max.   :212.0     Max.   :4800              
##       year     
##  Min.   :2007  
##  1st Qu.:2007  
##  Median :2008  
##  Mean   :2008  
##  3rd Qu.:2009  
##  Max.   :2009
```

---

---

# `dplyr::mutate()` to make new variables

The body_mass_g is currently in grams. Let's create a new variable for the mass in kg. Save the updated dataset as `penguins.mod`.

```r
penguins.mod <- penguins %>%
  mutate(body_mass_kg = body_mass_g / 1000)

penguins.mod %>%
  select(body_mass_g, body_mass_kg) %>%
  head()
```

```
## # A tibble: 6 x 2
##   body_mass_g body_mass_kg
##         <int>        <dbl>
## 1        3750         3.75
## 2        3800         3.8 
## 3        3250         3.25
## 4          NA        NA   
## 5        3450         3.45
## 6        3650         3.65
```

---

---

# Making categories or recoding - dplyr::case_when()

.pull-left[

```r
# Use case_when() to create
# three size categories
penguins_mod <- penguins %>%
  mutate(size_bin = case_when(
    body_mass_g > 4500 ~ "large",
    body_mass_g > 3000 & 
      body_mass_g <= 4500 ~ "medium",
    body_mass_g <= 3000 ~ "small"
    )
  )

# Create summary table 
# of size categories
penguins_mod %>% 
  pull(size_bin) %>% 
* table(useNA = "ifany") %>%
  knitr::kable(
    col.names = c("Size Category",
                  "Frequency"))
```

]

.pull-right[

### Table of Body Size Categories

|Size Category | Frequency|
|:-------------|---------:|
|large         |       115|
|medium        |       216|
|small         |        11|
|NA            |         2|

]

.footnote[Notice `useNA = "ifany"` option added to `table()` to print the number of missing `NA`'s]

---

# Quick Table Example - Arsenal Package

```r
library(arsenal)

tab1 <- tableby(island ~ bill_length_mm + bill_depth_mm, 
                data = penguins)
summary(tab1)
```

|                            | Biscoe (N=168)  |  Dream (N=124)  | Torgersen (N=52) |  Total (N=344)  | p value|
|:---------------------------|:---------------:|:---------------:|:----------------:|:---------------:|-------:|
|**bill_length_mm**          |                 |                 |                  |                 | < 0.001|
|&nbsp;&nbsp;&nbsp;N-Miss    |        1        |        0        |        1         |        2        |        |
|&nbsp;&nbsp;&nbsp;Mean (SD) | 45.257 (4.773)  | 44.168 (5.954)  |  38.951 (3.025)  | 43.922 (5.460)  |        |
|&nbsp;&nbsp;&nbsp;Range     | 34.500 - 59.600 | 32.100 - 58.000 | 33.500 - 46.000  | 32.100 - 59.600 |        |
|**bill_depth_mm**           |                 |                 |                  |                 | < 0.001|
|&nbsp;&nbsp;&nbsp;N-Miss    |        1        |        0        |        1         |        2        |        |
|&nbsp;&nbsp;&nbsp;Mean (SD) | 15.875 (1.821)  | 18.344 (1.133)  |  18.429 (1.339)  | 17.151 (1.975)  |        |
|&nbsp;&nbsp;&nbsp;Range     | 13.100 - 21.100 | 15.500 - 21.200 | 15.900 - 21.500  | 13.100 - 21.500 |        |

.footnote[Learn more about the `arsenal` package and `tableby()` function at [https://cran.r-project.org/web/packages/arsenal/vignettes/tableby.html](https://cran.r-project.org/web/packages/arsenal/vignettes/tableby.html).]