R Packages needed for this Rmarkdown and exercises for today

NOTE: The dplyr and tibble packages are all part of the “CORE Tidyverse”, so you can also simply load the tidyverse package. So, when you load library(tidyverse), you simultaneously load ALL of these core packages:

Load the Davis dataset from the carData package

# load the carData package
# we'll work with the Davis dataset
# which is a part of this package

#library(carData)
#data(Davis)

Take a quick look at the Davis dataset

Load the tibble package and use the glimpse() function to take a quick peek at the Davis dataset.

The Davis dataset has 200 rows and 5 columns. The subjects were men and women engaged in regular exercise. There are some missing data. The 5 variables are:

To learn more about this dataset run help(Davis, package = "carData").

library(tibble)
glimpse(Davis)
## Error in glimpse(Davis): object 'Davis' not found

Compute BMI from measured height and weight

The equation for BMI is

\[BMI = \frac{weight (kg)}{[height (m)]^2}\]

To compute BMI we need to:

  1. convert height in cm to m
  2. then compute BMI

So, let’s add 2 new variables to our dataset using the mutate() function from the dplyr package.

# load dplyr package
# library(dplyr)

Davis2 <- Davis %>%
  mutate(height_m = height/100) %>%
  mutate(bmi = weight / ((height_m)^2))
## Error in mutate(., bmi = weight/((height_m)^2)): could not find function "mutate"

BMI histograms by sex

Let’s try out the ggpubr package and use the ggdensity() function.

Learn more at https://rpkgs.datanovia.com/ggpubr/.

NOTE: Try custom colors with hex colors, https://www.colorbook.io/hexcolors/view/00AFBB.

# try custom colors using hex codes
# see https://www.colorbook.io/hexcolors/view/00AFBB 
library(ggpubr)
ggdensity(Davis2, x = "bmi",
   add = "mean", rug = TRUE,
   color = "sex", fill = "sex",
   palette = c("#D689E8", "#00AFBB"))
## Error in ggdensity(Davis2, x = "bmi", add = "mean", rug = TRUE, color = "sex", : object 'Davis2' not found

Well this looks odd. I’m guessing there is an outlier somewhere. We could open the data and look at it in a viewer, but let’s try to do it with code.

The dplyr package also has an arrange() function. So, let’s sort the data and see if we can spot which case has the really large BMI value.

The default is to arrange (or sort) the rows in ascending order. But we want to see the largest value, so we’ll add desc() to get the descending sorted order.

Davis2 %>%
  arrange(desc(bmi)) %>%
  head()
## Error in arrange(., desc(bmi)): could not find function "arrange"

If I had to guess, it looks like the measured height and weight were flipped for case 12. But for now let’s filter this case out and remake our plot.

So, we’ll use the filter() function also from dplyr package.

Davis3 <- Davis2 %>%
  filter(bmi < 50)
## Error in as.ts(x): object 'Davis2' not found
ggdensity(Davis3, x = "bmi",
   add = "mean", rug = TRUE,
   color = "sex", fill = "sex",
   palette = c("#D689E8", "#00AFBB")) +
  ggtitle("BMI from Measured Weight and Height")
## Error in ggdensity(Davis3, x = "bmi", add = "mean", rug = TRUE, color = "sex", : object 'Davis3' not found

Get summary statistics of weight, height and bmi

Let’s get the mean for weight, height and bmi

Davis3 %>%
  summarise(across(c(weight, height, bmi),
            ~ mean(.x, na.rm = TRUE))
  )
## Error in summarise(., across(c(weight, height, bmi), ~mean(.x, na.rm = TRUE))): could not find function "summarise"

Add group_by() to get the means by sex

Davis3 %>%
  group_by(sex) %>%
  summarise(across(c(weight, height, bmi),
            ~ mean(.x, na.rm = TRUE))
  )
## Error in summarise(., across(c(weight, height, bmi), ~mean(.x, na.rm = TRUE))): could not find function "summarise"

This is easier with get_summary_stats() from rstatix package

Let’s try this again and get more stats.

First use the select() function from dplyr and then get the summary stats.

library(rstatix)

Davis3 %>% 
  select(weight, height, bmi) %>%
  get_summary_stats()
## Error in select(., weight, height, bmi): object 'Davis3' not found

Let’s just get mean and sd (standard deviation) and add group_by() to get the stats by sex. NOTE: Add sex to the select() step.

Davis3 %>% 
  group_by(sex) %>%
  select(sex, weight, height, bmi) %>%
  get_summary_stats(type = "mean_sd")
## Error in group_by(., sex): object 'Davis3' not found

Compare measured vs self-report heights and weights by sex

I’ve often heard a saying that “women weigh less and men are taller on paper”. But let’s take a look at the discrepancies between the directly measures height and weight to the self-reports repwt and repht - overall and by sex.

Davis3 <- Davis3 %>%
  mutate(diff_wt_repwt = weight - repwt) %>%
  mutate(diff_ht_repht = height - repht)
## Error in mutate(., diff_wt_repwt = weight - repwt): object 'Davis3' not found

Now that we’ve computed these differences, let’s look at these differences by sex. Differences < 0 indicate that the self-reported repwt or repht were larger than the measured weight or height.

Let’s keep using the ggpubr package and try the ggboxplot() function and add a reference line, a title and a subtitle using functions from ggplot2 package which is loaded with ggpubr. We’ll also clean up the x-axis and y-axis labels.

ggboxplot(Davis3, x = "sex", y = "diff_wt_repwt",
                color = "sex", 
                palette =c("#D689E8", "#00AFBB"),
                add = "jitter", shape = "sex") +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "Difference in Weights = weight - repwt",
       subtitle = "diff > 0 when measured weight > reported weight") +
  xlab("Biological Sex") +
  ylab("Difference = Measured Weight - Reported Weight")
## Error in ggboxplot(Davis3, x = "sex", y = "diff_wt_repwt", color = "sex", : object 'Davis3' not found

It looks like for females, their actual weights are larger than their self-reported weights.

Let’s take a look at the differences in the heights.

ggboxplot(Davis3, x = "sex", y = "diff_ht_repht",
                color = "sex", 
                palette =c("#D689E8", "#00AFBB"),
                add = "jitter", shape = "sex") +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "Difference in Heights = height - repht",
       subtitle = "diff > 0 when measured height > reported height")  +
  xlab("Biological Sex") +
  ylab("Difference = Measured Height - Reported Height")
## Error in ggboxplot(Davis3, x = "sex", y = "diff_ht_repht", color = "sex", : object 'Davis3' not found

From this plot it looks like the measured heights are higher than the self-reported heights for both females and males.

Get summary stats of these differences by sex.

Davis3 %>% 
  group_by(sex) %>%
  select(sex, diff_wt_repwt, diff_ht_repht) %>%
  get_summary_stats(type = "mean_sd")
## Error in group_by(., sex): object 'Davis3' not found