Aims:
to practice statistical thinking on graphical summaries
to produce a data analytic report in Rmarkdown
to become familiar with ggplots
Complete one of the introduction to R resources given in the pre-class material at home. Review materials from your introductory statistics course.
For the exercises in this lab it is expected that you will work in groups as it will expose you to different views, ideas and opinions. However, please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations.
“The greatest value of a picture is when it forces us to notice what we never expected to see.”-John W. Tukey
At the beginning of this lab you will be split up into groups of four or five. This tutorial is designed to generate discussion and provide an opportunity to practise statistical thinking and communicating statistical concepts. There are no “correct” answers. As a group discuss and brainstorm the following:
How do you think succeeding in this course could help your career?
Discuss which researcher you have chosen to write a profile on and what you have found.
For the next exercise we are going to trial using google docs as a group workspace. The Week 1 google docs link is here. Open the google doc which contains your group number and populate the table section corresponding to your group number with notes from your discussions. For whiteboard style brainstorming/visualisation, use the “Jamboard”. At the top of the Jamboard scroll to the sheet corresponding to your assigned group. You can use this jamboard later in this task to summarise your group’s discussions to your peers.
This is a historical dataset given as an example in William Gosset’s 1908 paper on The Probable Error of a Mean. In this paper, Gosset introduced a form of what later became known as Student’s t-distribution. The data used in his paper was actually taken from a table by A. R. Cushny and A. R. Peebles in the Journal of Physiology for 1904, showing the different effects of the optical isomers of hyoseyamine hydrobromide in producing sleep. Gosset describe the data as “the sleep of 10 patients was measured without hypnotic and after treatment (1) with D. hyoseyarnine hydrobromide, (2) with L. hyoseyamine hydrobromide.”
The average number of hours’ sleep gained by the use of the drug is provided in the file StudenttSleepData.csv. The interest of this study is to understand the effectiveness of the two treatments.
Press the “code” button on the right for to see how we might approach this in R. For most weeks I recommend to try… 1. running the lines of code as is without thinking too hard, 2. then think about what each line might be doing, 3. discuss this as a group and ask the tutor for help or feedback and 4. try altering the code and see what happens (mostly in the data wrangling this week.) These code snippets are not always complete and are designed to be worked through collaboratively with discussion.
## Load the R packages that we need.
## ########################################### These contain a bunch of useful
## functions. Do you know how to install them? In RStudio Go to Tools ->
## Install Packages -> Install from Repository.
library(ggplot2)
library(tidyr)
## Read in the data.
## ########################################################### This data is
## stored on Grant's website but the same command can be used to read in .csv
## files from your personal computer. Try downloading it locally and then
## reading it in. What does row.names = 1 mean? Look at he ugly help
## documentation using help('read.csv') or ?read.csv
sleep <- read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/StudentTSleepData.csv",
row.names = 1)
## Look at the data.
## ###########################################################
sleep
## Reshape the data.
## ########################################################## There is a lot to
## unpack here. First run it and think 'what did this code do?'. Now move on
## and then after you've looked at some plots come back and talk about what $,
## factor, pivot_longer, coles, names_to and values_to actually mean. You
## might like to look at help('pivot_longer') or vignette('pivot').
sleep$patient <- factor(1:10)
sleepLong <- pivot_longer(sleep, cols = c("Dextro", "Laevo"), names_to = "treatment",
values_to = "hours")
head(sleepLong)
## Different types of plots.
## ################################################### What different aspects
## of the data do each of these plots highlight?
## Boxplot
ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_boxplot()
ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_boxplot() +
geom_point()
ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_boxplot() +
geom_point() + theme_classic()
## Line plot
ggplot(sleepLong, aes(x = treatment, y = hours, group = patient, col = patient)) +
geom_line()
## Density plot
ggplot(sleepLong, aes(x = hours, group = treatment, col = treatment)) + geom_density()
## Violin plot
ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_violin() +
geom_point()
## Bar plots
ggplot(sleepLong, aes(x = patient, y = hours, fill = treatment)) + geom_bar(stat = "identity")
ggplot(sleepLong, aes(x = patient, y = hours, fill = treatment)) + geom_bar(stat = "identity",
position = position_dodge())
ggplot(sleepLong, aes(x = patient, y = hours, fill = treatment)) + geom_bar(stat = "identity",
position = position_dodge()) + labs(title = "Sleep treatment", subtitle = "Bar plot",
caption = "(based on data from Table 1 in Student, 1908)") + ylab("hours")
Here is the section from taken from the actual paper through JStore website. Spot the typo in the table.
Only this section needs to be included in your Module 1 Lab Report to be handed in at the end of Week 3.
Report guidelines
There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:
It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”
Marking criteria
Lab instructions
The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease. The data we will be investigating is a subset of the data collected.
setwd()
to where your data is stored and download the data directly from the
website. (https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/frmgham.csv)## Loading data directly from the web
heartData <- read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/frmgham.csv")
## If your data file is in the same folder, you can also use code
heartData <- read.csv("frmgham.csv")
## size of the data set
dim(heartData)
heartData
?class(heartData)
heartData
?str(heartData)
sex <- heartData$SEX
sexC <- as.character(sex)
sexF <- factor(sexC, levels = c(1, 2), labels = c("Men", "Women"))
class(sex)
class(sexC)
class(sexF)
Select a univariate variable and explore using what you have learn from MATH1X05 or DATA1X01.
head(heartData$SYSBP)
summary(heartData$SYSBP)
table(heartData$SEX)
hist(heartData$SYSBP, prob = T)
boxplot(heartData$SYSBP)
hist(heartData$SYSBP, freq = FALSE, main = "Histogram", ylab = "Probabilities", col = "green")
boxplot(heartData$SYSBP, horizontal = TRUE, col = "red")
boxplot(SYSBP ~ SEX, data = heartData)
library(tidyverse)
## Graphical summary of the variable 'Age'
ggplot(heartData, aes(x = 1, y = SYSBP)) + geom_boxplot()
## Relatoinship between Age and Sex.
ggplot(heartData, aes(x = factor(SEX), y = SYSBP)) + geom_boxplot()
## Adding some colors
ggplot(heartData, aes(x = factor(SEX), y = SYSBP, fill = SEX)) + geom_boxplot()
ggplot(heartData, aes(x = factor(SEX), y = SYSBP, col = SEX)) + geom_boxplot()
ggplot(heartData, aes(x = factor(SEX), y = SYSBP)) + geom_violin()
Provide your data analytics code as well as summarising your findings in a reproducible report (e.g. Rmarkdown report).