Aims:
to practice statistical thinking on graphical summaries
to become familiar with ggplot
to produce a data analytic report in Rmarkdown
For the exercises in this lab it is expected that you will work in groups of 4 or 5 as it will expose you to different views, ideas and opinions. However, please make an effort to generate your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. This workshop is designed to generate discussion and provide an opportunity to practise communicating statistical concepts and visualising quantitative data. There are no “correct” answers.
“The greatest value of a picture is when it forces us to notice what we never expected to see.”-John W. Tukey
The NHANES dataset is a nationally representative survey that combines demographic data, physical examinations, and laboratory measurements to assess the health of the U.S. population. Please refer to the accompanying data dictionary for detailed descriptions of each variable and their sources before beginning your analysis.
Press the “show code” button on the right to see how we might approach this in R. I recommend that you try… 1. running the lines of code as is without thinking too hard, 2. then think about what each line might be doing, 3. discuss this as a group and ask the tutor for help or feedback and 4. try altering the code and see what happens
## These packages contain a bunch of useful functions. Do you know how to
## install them? In RStudio Go to Tools -> Install Packages -> Install from
## Repository. OR you can install them by install.packages('packageName') You
## only need to install a package ONCE. Once installed, you then need to load
## the package(s) you wish to use in the R session. You load the packages using
## the library() function.
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## This data is stored on Grant's website but the same command can be used to
## read in .csv files from your personal computer. Try downloading it locally
## and then reading it in.
NHvars <- read.csv("https://wimr-genomics.vip.sydney.edu.au/BBHE1006/NHVars.csv")
## Look at the data.
dim(NHvars)
## [1] 8153 20
head(NHvars)
## SEQN RIAGENDR RIDAGEYR BMXWT BMXHT BMXBMI BMXWAIST BPXOSY1 BPXODI1 LBXTC
## 1 130378 Male 43 86.9 179.5 27.0 98.3 135 98 264
## 2 130379 Male 66 101.8 174.2 33.5 114.7 121 84 214
## 3 130380 Female 44 69.4 152.9 29.7 93.5 111 79 187
## 4 130384 Male 43 NA NA NA NA NA NA NA
## 5 130385 Female 65 NA NA NA NA NA NA NA
## 6 130386 Male 34 90.6 173.3 30.2 106.1 110 72 183
## LBDHDD LBDLDLM LBXGLU LBXIN LBXHSCRP LBXVD3MS LBXTST LBXEST PhysActive
## 1 45 190 113 15.53 1.78 57.3 668.0 24.4 Yes
## 2 60 135 99 19.91 2.03 58.9 437.0 25.2 Yes
## 3 49 90 156 16.33 5.62 37.8 13.8 85.0 Yes
## 4 NA NA NA NA NA NA NA NA No
## 5 NA NA NA NA NA NA NA NA Yes
## 6 46 111 100 11.38 1.05 95.3 473.0 14.0 Yes
## AgeBand
## 1 40–59
## 2 60–79
## 3 40–59
## 4 40–59
## 5 60–79
## 6 20–39
str(NHvars)
## 'data.frame': 8153 obs. of 20 variables:
## $ SEQN : int 130378 130379 130380 130384 130385 130386 130387 130388 130389 130390 ...
## $ RIAGENDR : chr "Male" "Male" "Female" "Male" ...
## $ RIDAGEYR : int 43 66 44 43 65 34 68 27 59 31 ...
## $ BMXWT : num 86.9 101.8 69.4 NA NA ...
## $ BMXHT : num 180 174 153 NA NA ...
## $ BMXBMI : num 27 33.5 29.7 NA NA 30.2 42.6 43.7 28 46 ...
## $ BMXWAIST : num 98.3 114.7 93.5 NA NA ...
## $ BPXOSY1 : int 135 121 111 NA NA 110 143 130 145 113 ...
## $ BPXODI1 : int 98 84 79 NA NA 72 76 95 76 78 ...
## $ LBXTC : int 264 214 187 NA NA 183 203 NA NA 159 ...
## $ LBDHDD : int 45 60 49 NA NA 46 42 NA NA 39 ...
## $ LBDLDLM : int 190 135 90 NA NA 111 NA NA NA NA ...
## $ LBXGLU : int 113 99 156 NA NA 100 NA NA NA NA ...
## $ LBXIN : num 15.5 19.9 16.3 NA NA ...
## $ LBXHSCRP : num 1.78 2.03 5.62 NA NA 1.05 3.96 NA NA 11.2 ...
## $ LBXVD3MS : num 57.3 58.9 37.8 NA NA 95.3 25.1 NA NA 60.9 ...
## $ LBXTST : num 668 437 13.8 NA NA 473 18.7 NA NA 51.5 ...
## $ LBXEST : num 24.4 25.2 85 NA NA 14 15.4 NA NA 121 ...
## $ PhysActive: chr "Yes" "Yes" "Yes" "No" ...
## $ AgeBand : chr "40–59" "60–79" "40–59" "40–59" ...
# These next two lines are optional - I will show you this visualisation in
# class. It highlights that there are many missing values in this dataset.
# library(naniar) vis_miss(NHvars)
# Print an ordered table showing the number of missing values (NAs) for each
# variable
sort(colSums(is.na(NHvars)), decreasing = TRUE)
## LBDLDLM LBXIN LBXGLU LBXTC LBDHDD LBXEST LBXTST
## 5001 4975 4824 2436 2436 2397 2384
## LBXHSCRP LBXVD3MS BMXWAIST BPXOSY1 BPXODI1 BMXBMI BMXWT
## 2332 2328 2134 2030 2030 1918 1905
## BMXHT SEQN RIAGENDR RIDAGEYR PhysActive AgeBand
## 1891 0 0 0 0 0
## What different aspects of the data do each of these plots highlight?
## Boxplot
ggplot(NHvars, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot()
## Warning: Removed 1918 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Notice here we get a Warning message about some rows being removed. This is
## because there are some rows that have missing values for BMI and/or
## PhysActive status. We can fix this by filtering the dataset to only keep the
## rows that don't have missing values for these two variables.
## We will create a new object to store the filtered dataset.
NHplot1 <- NHvars %>%
filter(!is.na(BMXBMI), !is.na(PhysActive))
## Now we can generate the boxplot using the filtered NHplot1 object.
ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot()
ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot() +
geom_jitter(alpha = 0.5, size = 0.5)
ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_boxplot() +
geom_jitter(alpha = 0.5, size = 0.5) + theme_classic()
## Histogram
ggplot(NHplot1, aes(x = BMXBMI, group = PhysActive, col = PhysActive)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Density plot
ggplot(NHplot1, aes(x = BMXBMI, group = PhysActive, col = PhysActive)) + geom_density()
## Violin plot
ggplot(NHplot1, aes(x = PhysActive, y = BMXBMI, col = PhysActive)) + geom_violin()
## Bar plots
ggplot(NHplot1, aes(x = AgeBand, fill = PhysActive)) + geom_bar()
ggplot(NHplot1, aes(x = AgeBand, fill = PhysActive)) + geom_bar(position = "dodge")
ggplot(NHplot1, aes(x = AgeBand, fill = PhysActive)) + geom_bar(position = "dodge") +
labs(x = "Age Band", y = "Count", title = "Number of individuals by Age Band and Physical Activity") +
theme_minimal()