Aims: 
 to practice statistical thinking on graphical summaries 
 to become familiar with ggplot 
 to produce a data analytic report in Rmarkdown


1 Group work

For the exercises in this lab it is expected that you will work in groups of 4 or 5 as it will expose you to different views, ideas and opinions. However, please make an effort to generate your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. This workshop is designed to generate discussion and provide an opportunity to practise communicating statistical concepts and visualising quantitative data. There are no “correct” answers.

“The greatest value of a picture is when it forces us to notice what we never expected to see.”-John W. Tukey

2 Data exercises

2.1 NHANES dataset

The NHANES dataset is a nationally representative survey that combines demographic data, physical examinations, and laboratory measurements to assess the health of the U.S. population. Please refer to the accompanying data dictionary for detailed descriptions of each variable and their sources before beginning your analysis.

  1. What questions (or hypothesis) could you formulate here ?
  2. How would you visualize this data to best illustrate the results ?
  3. Think about what you will expect to see if you are drawing a histogram, scatter plot, boxplot, or barplot.
  4. We will use this interesting dataset to learn how to create graphics using ggplot.

Press the “show code” button on the right to see how we might approach this in R. I recommend that you try… 1. running the lines of code as is without thinking too hard, 2. then think about what each line might be doing, 3. discuss this as a group and ask the tutor for help or feedback and 4. try altering the code and see what happens

2.1.1 Load the R packages that we need

## These packages contain a bunch of useful functions.  Do you know how to
## install them? In RStudio Go to Tools -> Install Packages -> Install from
## Repository.  OR you can install them by install.packages('packageName') You
## only need to install a package ONCE. Once installed, you then need to load
## the package(s) you wish to use in the R session. You load the packages using
## the library() function.

library(ggplot2)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2.1.2 Read in the data.

## This data is stored on Grant's website but the same command can be used to
## read in .csv files from your personal computer. Try downloading it locally
## and then reading it in.


NHvars <- read.csv("https://wimr-genomics.vip.sydney.edu.au/BBHE1006/NHVars.csv")

## Look at the data.
dim(NHvars)
## [1] 8153   20
head(NHvars)
##     SEQN RIAGENDR RIDAGEYR BMXWT BMXHT BMXBMI BMXWAIST BPXOSY1 BPXODI1 LBXTC
## 1 130378     Male       43  86.9 179.5   27.0     98.3     135      98   264
## 2 130379     Male       66 101.8 174.2   33.5    114.7     121      84   214
## 3 130380   Female       44  69.4 152.9   29.7     93.5     111      79   187
## 4 130384     Male       43    NA    NA     NA       NA      NA      NA    NA
## 5 130385   Female       65    NA    NA     NA       NA      NA      NA    NA
## 6 130386     Male       34  90.6 173.3   30.2    106.1     110      72   183
##   LBDHDD LBDLDLM LBXGLU LBXIN LBXHSCRP LBXVD3MS LBXTST LBXEST PhysActive
## 1     45     190    113 15.53     1.78     57.3  668.0   24.4        Yes
## 2     60     135     99 19.91     2.03     58.9  437.0   25.2        Yes
## 3     49      90    156 16.33     5.62     37.8   13.8   85.0        Yes
## 4     NA      NA     NA    NA       NA       NA     NA     NA         No
## 5     NA      NA     NA    NA       NA       NA     NA     NA        Yes
## 6     46     111    100 11.38     1.05     95.3  473.0   14.0        Yes
##   AgeBand
## 1   40–59
## 2   60–79
## 3   40–59
## 4   40–59
## 5   60–79
## 6   20–39
str(NHvars)
## 'data.frame':    8153 obs. of  20 variables:
##  $ SEQN      : int  130378 130379 130380 130384 130385 130386 130387 130388 130389 130390 ...
##  $ RIAGENDR  : chr  "Male" "Male" "Female" "Male" ...
##  $ RIDAGEYR  : int  43 66 44 43 65 34 68 27 59 31 ...
##  $ BMXWT     : num  86.9 101.8 69.4 NA NA ...
##  $ BMXHT     : num  180 174 153 NA NA ...
##  $ BMXBMI    : num  27 33.5 29.7 NA NA 30.2 42.6 43.7 28 46 ...
##  $ BMXWAIST  : num  98.3 114.7 93.5 NA NA ...
##  $ BPXOSY1   : int  135 121 111 NA NA 110 143 130 145 113 ...
##  $ BPXODI1   : int  98 84 79 NA NA 72 76 95 76 78 ...
##  $ LBXTC     : int  264 214 187 NA NA 183 203 NA NA 159 ...
##  $ LBDHDD    : int  45 60 49 NA NA 46 42 NA NA 39 ...
##  $ LBDLDLM   : int  190 135 90 NA NA 111 NA NA NA NA ...
##  $ LBXGLU    : int  113 99 156 NA NA 100 NA NA NA NA ...
##  $ LBXIN     : num  15.5 19.9 16.3 NA NA ...
##  $ LBXHSCRP  : num  1.78 2.03 5.62 NA NA 1.05 3.96 NA NA 11.2 ...
##  $ LBXVD3MS  : num  57.3 58.9 37.8 NA NA 95.3 25.1 NA NA 60.9 ...
##  $ LBXTST    : num  668 437 13.8 NA NA 473 18.7 NA NA 51.5 ...
##  $ LBXEST    : num  24.4 25.2 85 NA NA 14 15.4 NA NA 121 ...
##  $ PhysActive: chr  "Yes" "Yes" "Yes" "No" ...
##  $ AgeBand   : chr  "40–59" "60–79" "40–59" "40–59" ...
# These next two lines are optional - I will show you this visualisation in
# class. It highlights that there are many missing values in this dataset.
# library(naniar) vis_miss(NHvars)

# Print an ordered table showing the number of missing values (NAs) for each
# variable
sort(colSums(is.na(NHvars)), decreasing = TRUE)
##    LBDLDLM      LBXIN     LBXGLU      LBXTC     LBDHDD     LBXEST     LBXTST 
##       5001       4975       4824       2436       2436       2397       2384 
##   LBXHSCRP   LBXVD3MS   BMXWAIST    BPXOSY1    BPXODI1     BMXBMI      BMXWT 
##       2332       2328       2134       2030       2030       1918       1905 
##      BMXHT       SEQN   RIAGENDR   RIDAGEYR PhysActive    AgeBand 
##       1891          0          0          0          0          0

2.1.3 Use ggplot to generate different types of plots.

## What different aspects of the data do each of these plots highlight?

## Boxplot
ggplot(NHvars, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot()
## Warning: Removed 1918 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Notice here we get a Warning message about some rows being removed. This is
## because there are some rows that have missing values for BMI and/or
## PhysActive status. We can fix this by filtering the dataset to only keep the
## rows that don't have missing values for these two variables.

## We will create a new object to store the filtered dataset.

NHplot1 <- NHvars %>%
    filter(!is.na(BMXBMI), !is.na(PhysActive))

## Now we can generate the boxplot using the filtered NHplot1 object.

ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot()

ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot() +
    geom_jitter(alpha = 0.5, size = 0.5)

ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot() +
    geom_jitter(alpha = 0.5, size = 0.5) + theme_classic()

## Histogram
ggplot(NHplot1, aes(x = BMXBMI, group = PhysActive, col = PhysActive)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Density plot
ggplot(NHplot1, aes(x = BMXBMI, group = PhysActive, col = PhysActive)) + geom_density()

## Violin plot
ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_violin()

## Bar plots
ggplot(NHplot1, aes(x = AgeBand, fill = PhysActive)) + geom_bar()

ggplot(NHplot1, aes(x = AgeBand, fill = PhysActive)) + geom_bar(position = "dodge")

2.1.4 Adding in labels and titles, and setting a theme

ggplot(NHplot1, aes(x = AgeBand, fill = PhysActive)) + geom_bar(position = "dodge") +
    labs(x = "Age Band", y = "Count", title = "Number of individuals by Age Band and Physical Activity") +
    theme_minimal()

2.2 Task: R Markdown report

  1. Identify a different combination of variables from the NHANES dataset that would be meaningful to visualise.
  2. Adapt the codes provided above to generate some plots for your new set of variables. Note: How did we remove the observations (rows) that contained missing data for our first set of variables?
  3. Provide your data analytics code as well as a summary of your findings in a reproducible report using the R Markdown format. File - New File - RMarkdown gives you a Rmarkdown template that you can add your code to. We will go through this together in class.

2.3 Extension

  1. Use generative AI to obtain the R code needed to perform a statistical test (eg.two-sample t test) assessing the relationship between the variables you used for one of your visualisations.
  2. Add the result of the statistical test to your visualisation.